Results 1 - 10
of
17
Data Streams: Algorithms and Applications
, 2005
"... In the data stream scenario, input arrives very rapidly and there is limited memory to store the input. Algorithms have to work with one or few passes over the data, space less than linear in the input size or time significantly less than the input size. In the past few years, a new theory has emerg ..."
Abstract
-
Cited by 533 (22 self)
- Add to MetaCart
In the data stream scenario, input arrives very rapidly and there is limited memory to store the input. Algorithms have to work with one or few passes over the data, space less than linear in the input size or time significantly less than the input size. In the past few years, a new theory has emerged for reasoning about algorithms that work within these constraints on space, time, and number of passes. Some of the methods rely on metric embeddings, pseudo-random computations, sparse approximation theory and communication complexity. The applications for this scenario include IP network traffic analysis, mining text message streams and processing massive data sets in general. Researchers in Theoretical Computer Science, Databases, IP Networking and Computer Systems are working on the data stream challenges. This article is an overview and survey of data stream algorithmics and is an updated version of [175].1
Adaptive cleaning for rfid data streams
, 2006
"... ABSTRACT To compensate for the inherent unreliability of RFID data streams, most RFID middleware systems employ a "smoothing filter", a sliding-window aggregate that interpolates for lost readings. In this paper, we propose SMURF, the first declarative, adaptive smoothing filter for RFID ..."
Abstract
-
Cited by 101 (0 self)
- Add to MetaCart
ABSTRACT To compensate for the inherent unreliability of RFID data streams, most RFID middleware systems employ a "smoothing filter", a sliding-window aggregate that interpolates for lost readings. In this paper, we propose SMURF, the first declarative, adaptive smoothing filter for RFID data cleaning. SMURF models the unreliability of RFID readings by viewing RFID streams as a statistical sample of tags in the physical world, and exploits techniques grounded in sampling theory to drive its cleaning processes. Through the use of tools such as binomial sampling and π-estimators, SMURF continuously adapts the smoothing window size in a principled manner to provide accurate RFID data to applications.
What’s different: Distributed, continuous monitoring of duplicate-resilient aggregates on data streams
- In ICDE
, 2006
"... Emerging applications in sensor systems and network-wide IP traffic analysis present many technical challenges. They need distributed monitoring and continuous tracking of events. They have severe resource constraints not only at each site in terms of per-update processing time and archival space fo ..."
Abstract
-
Cited by 29 (8 self)
- Add to MetaCart
Emerging applications in sensor systems and network-wide IP traffic analysis present many technical challenges. They need distributed monitoring and continuous tracking of events. They have severe resource constraints not only at each site in terms of per-update processing time and archival space for highspeed streams of observations, but also crucially, communication constraints for collaborating on the monitoring task. These elements have been addressed in a series of recent works. A fundamental issue that arises is that one cannot make the “uniqueness ” assumption on observed events which is present in previous works, since widescale monitoring invariably encounters the same events at different points. For example, within the network of an Internet Service Provider packets of the same flow will be observed in different routers; similarly, the same individual will be observed by multiple mobile sensors in monitoring wild animals. Aggregates of interest on such distributed environments must be resilient to duplicate observations. We study such duplicate-resilient aggregates that measure the extent of the duplication—how many unique observations are there, how many observations are unique—as well as standard holistic aggregates such as quantiles and heavy hitters over the unique items. We present accuracy guaranteed, highly communication-efficient algorithms for these aggregates that work within the time and space constraints of high speed streams. We also present results of a detailed experimental study on both real-life and synthetic data. 1
Continuous Analytics Over Discontinuous Streams
"... Continuous analytics systems that enable query processing over steams of data have emerged as key solutions for dealing with massive data volumes and demands for low latency. These systems have been heavily influenced by an assumption that data streams can be viewed as sequences of data that arrived ..."
Abstract
-
Cited by 19 (1 self)
- Add to MetaCart
(Show Context)
Continuous analytics systems that enable query processing over steams of data have emerged as key solutions for dealing with massive data volumes and demands for low latency. These systems have been heavily influenced by an assumption that data streams can be viewed as sequences of data that arrived more or less in order. The reality, however, is that streams are not often so well behaved and disruptions of various sorts are endemic. We argue, therefore, that stream processing needs a fundamental rethink and advocate a unified approach toward continuous analytics over discontinuous streaming data. Our approach is based on a simple insight – using techniques inspired by data parallel query processing, queries can be performed over independent sub-streams with arbitrary time ranges in parallel, generating partial results. The consolidation of the partial results over each sub-stream can then be deferred to the time at which the results are actually used on an on-demand basis. In this paper, we describe how the Truviso Continuous Analytics system implements this type of order-independent processing. Not only does the approach provide the first real solution to the problem of processing streaming data that arrives arbitrarily late, it also serves as a critical building block for solutions to a host of hard problems such as parallelism, recovery, transactional consistency, high availability, failover, and replication. Categories and Subject Descriptors H.2.4 [Database Management]: Systems – query processing, parallel databases, transaction processing.
Time decaying aggregates in out-of-order streams
, 2007
"... Processing large data streams is now a major topic in data management. The data involved can be truly massive, and the required analyses complex. In a stream of sequential events such as stock feeds, sensor readings, or IP traffic measurements, data tuples pertaining to recent events are typically m ..."
Abstract
-
Cited by 14 (4 self)
- Add to MetaCart
(Show Context)
Processing large data streams is now a major topic in data management. The data involved can be truly massive, and the required analyses complex. In a stream of sequential events such as stock feeds, sensor readings, or IP traffic measurements, data tuples pertaining to recent events are typically more important than older ones. This can be formalized via time-decay functions, which assign weights to data based on the age of data. Decay functions such as sliding windows and exponential decay have been studied under the assumption of well-ordered arrivals, i.e., data arrives in non-decreasing order of time stamps. However, data quality issues are prevalent in massive streams (due to network asynchrony and delays etc.), and correct arrival order is not guaranteed. We focus on the computation of decayed aggregates such as range queries, quantiles, and heavy hitters on out-of-order streams, where elements do not necessarily arrive in increasing order of timestamps. Existing techniques such as Exponential Histograms and Waves are unable to handle out-of-order streams. We give the first deterministic algorithms for approximating these aggregates under popular decay functions such as sliding window and polynomial decay. We study the overhead of allowing out-of-order arrivals when compared to well-ordered arrivals, both analytically and experimentally. Our experiments confirm that these algorithms can be applied in practice, and compare the relative performance of different approaches for handling out-of-order arrivals.
Incremental View-Based Analysis of Stock Market Data Streams
"... behrend,dorau,manthey,schuelle ..."
(Show Context)
Declarative support for . . .
"... Pervasive applications rely on data captured from the physical world through sensor devices. Data provided by these devices, however, tend to be unreliable. The data must, therefore, be cleaned before an application can make use of them, leading to additional complexity for application development ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Pervasive applications rely on data captured from the physical world through sensor devices. Data provided by these devices, however, tend to be unreliable. The data must, therefore, be cleaned before an application can make use of them, leading to additional complexity for application development and deployment. Here we present Extensible Sensor stream Processing (ESP), a framework for building sensor data cleaning infrastructures for use in pervasive applications. ESP is designed as a pipeline using declarative cleaning mechanisms based on spatial and temporal characteristics of sensor data. We demonstrate ESP’s effectiveness and ease of use through three real-world scenarios.
Towards an advanced system for real-time event detection in high-volume data streams
- In Proceedings of the 5th workshop for Ph.D. students on Information and Knowledge Management, PIKM ’12, ACM
, 2012
"... This paper presents an advanced system for real-time event detection in high-volume data streams. Our main goal is to provide a system, which can handle high-volume data streams and is able to detect events in real-time. Addition-ally, we perform further steps, such as classifying and rank-ing event ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
(Show Context)
This paper presents an advanced system for real-time event detection in high-volume data streams. Our main goal is to provide a system, which can handle high-volume data streams and is able to detect events in real-time. Addition-ally, we perform further steps, such as classifying and rank-ing events with retrospective analysis. To solve this task we take advantage of a high-performance database system for semi-structured data and extend it with the functional-ity of continuous querying. The combination of executing queries on the incoming data stream and fast queries on the historical datasets is used as a powerful tool for developing an event detection and information system. Furthermore, we define several event features for improving event classifi-cation and for discovering parallelisms, relations, duration, and coherences of events.
Scalable Linked Data Stream Processing via Network-Aware Workload Scheduling
"... Abstract. In order to cope with the ever-increasing data volume, distributed stream processing systems have been proposed. To ensure scalability most distributed systems partition the data and distribute the workload among multiple machines. This approach does, however, raise the question how the da ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Abstract. In order to cope with the ever-increasing data volume, distributed stream processing systems have been proposed. To ensure scalability most distributed systems partition the data and distribute the workload among multiple machines. This approach does, however, raise the question how the data and the workload should be partitioned and distributed. A uniform scheduling strategy—a uniform distribution of computation load among available machines—typically used by stream processing systems, disregards network-load as one of the major bottlenecks for throughput resulting in an immense load in terms of intermachine communication. In this paper we propose a graph-partitioning based approach for workload scheduling within stream processing systems. We implemented a distributed triple-stream processing engine on top of the Storm realtime computation framework and evaluate its communication behavior using two real-world datasets. We show that the application of graph partitioning algorithms can decrease inter-machine communication substantially (by 40 % to 99%) whilst maintaining an even workload distribution, even using very limited data statistics. We also find that processing RDF data as single triples at a time rather than graph fragments (containing multiple triples), may decrease throughput indicating the usefulness of semantics.
Quality-driven evaluation of trigger conditions on streaming time series
, 2004
"... For many applications, it is important to evaluate trigger conditions on time series streams. In a resource constrained environment, users ’ needs should ultimately decide how the evaluation system balances the competing factors such as evaluation speed, result precision, and load shedding level. Th ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
For many applications, it is important to evaluate trigger conditions on time series streams. In a resource constrained environment, users ’ needs should ultimately decide how the evaluation system balances the competing factors such as evaluation speed, result precision, and load shedding level. This paper presents a basic framework for evaluation algorithms that takes user-specified quality requirements into consideration. Three optimization algorithms, each under a different set of quality requirements, are developed in the framework: (1) minimize the response time given accuracy requirements and without load shedding; (2) minimize the load shedding given a response time limit and accuracy requirements; and (3) minimize one type of accuracy errors given a response time limit and without load shedding. Experiments show that these optimization algorithms effectively achieve their optimization goals while satisfying the corresponding user-specified quality requirements. 1.