Results 1 - 10
of
156
High-performance complex event processing over streams
- In SIGMOD
, 2006
"... In this paper, we present the design, implementation, and evaluation of a system that executes complex event queries over real-time streams of RFID readings encoded as events. These complex event queries filter and correlate events to match specific patterns, and transform the relevant events into n ..."
Abstract
-
Cited by 161 (7 self)
- Add to MetaCart
(Show Context)
In this paper, we present the design, implementation, and evaluation of a system that executes complex event queries over real-time streams of RFID readings encoded as events. These complex event queries filter and correlate events to match specific patterns, and transform the relevant events into new composite events for the use of external monitoring applications. Stream-based execution of these queries enables time-critical actions to be taken in environments such as supply chain management, surveillance and facility management, healthcare, etc. We first propose a complex event language that significantly extends existing event languages to meet the needs of a range of RFID-enabled monitoring applications. We then describe a query plan-based approach to efficiently implementing this language. Our approach uses native operators to efficiently handle query-defined sequences, which are a key component of complex event processing, and pipelines such sequences to subsequent operators that are built by leveraging relational techniques. We also develop a large suite of optimization techniques to address challenges such as large sliding windows and intermediate result sizes. We demonstrate the effectiveness of our approach through a detailed performance analysis of our prototype implementation as well as through a comparison to a state-of-the-art stream processor. 1
Fault-tolerance in the Borealis distributed stream processing system
- In Proc. of the 2005 ACM SIGMOD International Conference on Management of Data
, 2005
"... Over the past few years, Stream Processing Engines (SPEs) have emerged as a new class of software systems, enabling low latency processing of streams of data arriving at high rates. As SPEs mature and get used in monitoring applications that must continuously run (e.g., in network security monitorin ..."
Abstract
-
Cited by 97 (9 self)
- Add to MetaCart
(Show Context)
Over the past few years, Stream Processing Engines (SPEs) have emerged as a new class of software systems, enabling low latency processing of streams of data arriving at high rates. As SPEs mature and get used in monitoring applications that must continuously run (e.g., in network security monitoring), a significant challenge arises: SPEs must be able to handle various software and hardware faults that occur, masking them to provide high availability (HA). In this paper, we develop, implement, and evaluate DPC (Delay, Process, and Correct), a protocol to handle crash failures of processing nodes and network failures in a distributed SPE. Like previous approaches to HA, DPC uses replication and masks many types of node and network failures. In the presence of network partitions, the designer of any replication system faces a choice between providing availability or data consistency across the replicas. In DPC, this choice is made explicit: the user specifies an availability bound (no result should be delayed by more than a specified delay threshold even under failure if the corresponding input is available), and DPC attempts to minimize the resulting inconsistency between replicas (not all of which might have seen the input data) while meeting the given delay threshold. Although conceptually simple, the DPC protocol tolerates the occurrence of multiple simultaneous failures as well as any
Contract-Based Load Management in Federated Distributed Systems
- In NSDI Symposium
, 2004
"... This paper focuses on load management in looselycoupled federated distributed systems. We present a distributed mechanism for moving load between autonomous participants using bilateral contracts that are negotiated offline and that set bounded prices for moving load. We show that our mechanism has ..."
Abstract
-
Cited by 87 (8 self)
- Add to MetaCart
(Show Context)
This paper focuses on load management in looselycoupled federated distributed systems. We present a distributed mechanism for moving load between autonomous participants using bilateral contracts that are negotiated offline and that set bounded prices for moving load. We show that our mechanism has good incentive properties, efficiently redistributes excess load, and has a low overhead in practice.
Communication-efficient distributed monitoring of thresholded counts
- In Proc. of SIGMOD’06
, 2006
"... Monitoring is an issue of primary concern in current and next gen-eration networked systems. For example, the objective of sensor networks is to monitor their surroundings for a variety of differ-ent applications like atmospheric conditions, wildlife behavior, and troop movements among others. Simil ..."
Abstract
-
Cited by 78 (11 self)
- Add to MetaCart
(Show Context)
Monitoring is an issue of primary concern in current and next gen-eration networked systems. For example, the objective of sensor networks is to monitor their surroundings for a variety of differ-ent applications like atmospheric conditions, wildlife behavior, and troop movements among others. Similarly, monitoring in data net-works is critical not only for accounting and management, but also for detecting anomalies and attacks. Such monitoring applications are inherently continuous and distributed, and must be designed to minimize the communication overhead that they introduce. In this context we introduce and study a fundamental class of problems called “thresholded counts ” where we must return the aggregate frequency count of an event that is continuously monitored by dis-tributed nodes with a user-specified accuracy whenever the actual count exceeds a given threshold value. In this paper we propose to address the problem of thresholded counts by setting local thresholds at each monitoring node and initi-ating communication only when the locally observed data exceeds these local thresholds. We explore algorithms in two categories: static thresholds and adaptive thresholds. In the static case, we consider thresholds based on a linear combination of two alternate strategies, and show that there exists an optimal blend of the two strategies that results in minimum communication overhead. We further show that this optimal blend can be found using a steep-est descent search. In the adaptive case, we propose algorithms that adjust the local thresholds based on the observed distributions of updated information in the distributed monitoring system. We use extensive simulations not only to verify the accuracy of our algorithms and validate our theoretical results, but also to evalu-ate the performance of the two approaches. We find that both ap-proaches yield significant savings over the naive approach of per-forming processing at a centralized location. 1.
High-availability algorithms for distributed stream processing
, 2004
"... Stream-processing systems are designed to support an emerging class of applications that require sophisticated and timely processing of high-volume data streams, often originating in distributed environments. Unlike traditional dataprocessing applications that require precise recovery for correctnes ..."
Abstract
-
Cited by 74 (15 self)
- Add to MetaCart
(Show Context)
Stream-processing systems are designed to support an emerging class of applications that require sophisticated and timely processing of high-volume data streams, often originating in distributed environments. Unlike traditional dataprocessing applications that require precise recovery for correctness, many stream-processing applications can tolerate and benefit from weaker recovery guarantees. In this paper, we study various recovery guarantees and pertinent recovery techniques that can meet the correctness and performance requirements of stream-processing applications. We discuss the design and algorithmic challenges associated with the proposed recovery techniques and describe how each can provide different guarantees with proper combinations of redundant processing, checkpointing, and remote logging. Using analysis and simulations, we quantify the cost of our recovery guarantees and examine the performance and applicability of the recovery techniques. We also analyze how the knowledge of query network properties can help decrease the cost of high availability.
Processing flows of information: from data stream to complex event processing
- ACM COMPUTING SURVEYS
, 2011
"... A large number of distributed applications requires continuous and timely processing of information as it flows from the periphery to the center of the system. Examples include intrusion detection systems which analyze network traffic in real-time to identify possible attacks; environmental monitori ..."
Abstract
-
Cited by 67 (11 self)
- Add to MetaCart
A large number of distributed applications requires continuous and timely processing of information as it flows from the periphery to the center of the system. Examples include intrusion detection systems which analyze network traffic in real-time to identify possible attacks; environmental monitoring applications which process raw data coming from sensor networks to identify critical situations; or applications performing online analysis of stock prices to identify trends and forecast future values. Traditional DBMSs, which need to store and index data before processing it, can hardly fulfill the requirements of timeliness coming from such domains. Accordingly, during the last decade, different research communities developed a number of tools, which we collectively call Information flow processing (IFP) systems, to support these scenarios. They differ in their system architecture, data model, rule model, and rule language. In this article, we survey these systems to help researchers, who often come from different backgrounds, in understanding how the various approaches they adopt may complement each other. In particular, we propose a general, unifying model to capture the different aspects of an IFP system and use it to provide a complete and precise classification of the systems and mechanisms proposed so far.
Network-Aware Query Processing for Stream-based Applications
, 2004
"... This paper investigates the benefits of network awareness when processing queries in widelydistributed environments such as the Internet. ..."
Abstract
-
Cited by 61 (0 self)
- Add to MetaCart
This paper investigates the benefits of network awareness when processing queries in widelydistributed environments such as the Internet.
EP-SPARQL: A Unified Language for Event Processing and Stream Reasoning
, 2011
"... Streams of events appear increasingly today in various Web applications such as blogs, feeds, sensor data streams, geospatial information, on-line financial data, etc. Event Processing (EP) is concerned with timely detection of compound events within streams of simple events. State-of-the-art EP pro ..."
Abstract
-
Cited by 56 (2 self)
- Add to MetaCart
Streams of events appear increasingly today in various Web applications such as blogs, feeds, sensor data streams, geospatial information, on-line financial data, etc. Event Processing (EP) is concerned with timely detection of compound events within streams of simple events. State-of-the-art EP provides on-the-fly analysis of event streams, but cannot combine streams with background knowledge and cannot perform reasoning tasks. On the other hand, semantic tools can effectively handle background knowledge and perform reasoning thereon, but cannot deal with rapidly changing data provided by event streams. To bridge the gap, we propose Event Processing SPARQL (EP-SPARQL) as a new language for complex events and Stream Reasoning. We provide syntax and formal semantics of the language and devise an effective execution model for the proposed formalism. The execution model is grounded on logic programming, and features effective event processing and inferencing capabilities over temporal and static knowledge. We provide an open-source prototype implementation and present a set of tests to show the usefulness and effectiveness of our approach.
Towards correcting input data errors probabilistically using integrity constraints
- In MobiDB
, 2006
"... Mobile and pervasive applications frequently rely on devices such as RFID antennas or sensors (light, temperature, motion) to provide them information about the physical world. These devices, however, are unreliable. They produce streams of information where portions of data may be missing, duplicat ..."
Abstract
-
Cited by 51 (9 self)
- Add to MetaCart
(Show Context)
Mobile and pervasive applications frequently rely on devices such as RFID antennas or sensors (light, temperature, motion) to provide them information about the physical world. These devices, however, are unreliable. They produce streams of information where portions of data may be missing, duplicated, or erroneous. Current state of the art is to correct errors locally (e.g., range constraints for temperature readings) or use spatial/temporal correlations (e.g., smoothing temperature readings). However, errors are often apparent only in a global setting, e.g., missed readings of objects that are known to be present, or exit readings from a parking garage without matching entry readings. In this paper, we present StreamClean, a system for correcting input data errors automatically using application defined global integrity constraints. Because it is frequently impossible to make corrections with certainty, we propose a probabilistic approach, where the system assigns to each input tuple the probability that it is correct. We show that StreamClean handles a large class of input data errors, and corrects them sufficiently fast to keep-up with input rates of many mobile and pervasive applications. We also show that the probabilities assigned by StreamClean correspond to a user’s intuitive notion of correctness.
Discretized Streams: Fault-tolerant streaming computation at scale
- In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP
, 2013
"... Many “big data ” applications must act on data in real time. Running these applications at ever-larger scales requires parallel platforms that automatically handle faults and stragglers. Unfortunately, current distributed stream processing models provide fault recovery in an expensive manner, requir ..."
Abstract
-
Cited by 45 (6 self)
- Add to MetaCart
(Show Context)
Many “big data ” applications must act on data in real time. Running these applications at ever-larger scales requires parallel platforms that automatically handle faults and stragglers. Unfortunately, current distributed stream processing models provide fault recovery in an expensive manner, requiring hot replication or long recovery times, and do not handle stragglers. We propose a new processing model, discretized streams (D-Streams), that overcomes these challenges. D-Streams enable a parallel recovery mechanism that improves efficiency over traditional replication and backup schemes, and tolerates stragglers. We show that they support a rich set of operators while attaining high per-node throughput similar to single-node systems, linear scaling to 100 nodes, subsecond latency, and sub-second fault recovery. Finally, D-Streams can easily be composed with batch and interactive query models like MapReduce, enabling rich applications that combine these modes. We implement D-Streams in a system called Spark Streaming. 1