Results 1 - 10
of
14
On predictability of system anomalies in real world
- In 18th Annual Meeting of the IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS
, 2010
"... Abstract—As computer systems become increasingly complex, system anomalies have become major concerns in system management. In this paper, we present a comprehensive measurement study to quantify the predictability of different system anomalies. Online anomaly prediction allows the system to foresee ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
(Show Context)
Abstract—As computer systems become increasingly complex, system anomalies have become major concerns in system management. In this paper, we present a comprehensive measurement study to quantify the predictability of different system anomalies. Online anomaly prediction allows the system to foresee impending anomalies so as to take proper actions to mitigate anomaly impact. Our anomaly prediction approach combines feature value prediction with statistical classification methods. We conduct extensive measurement study to investigate anomalous behavior of three systems in the real world: PlanetLab, SMART hard drive data, and IBM System S. We observe that real world system anomalies do exhibit predictability, which can be predicted with high accuracy and significant lead time. I.
S.: On the use of computational geometry to detect software faults at runtime
"... Despite advances in software engineering, software faults continue to cause system downtime. Software faults are difficult to detect before the system fails, especially since the first symptom of a fault is often system failure itself. This paper presents a computational geometry technique and a sup ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
(Show Context)
Despite advances in software engineering, software faults continue to cause system downtime. Software faults are difficult to detect before the system fails, especially since the first symptom of a fault is often system failure itself. This paper presents a computational geometry technique and a supporting tool to tackle the problem of timely fault detection during the execution of a software application. The approach involves collecting a variety of runtime measurements and building a geometric enclosure, such as a convex hull, which represents the normal (i.e., non-failing) operating space of the application being monitored. When collected runtime measurements are classified as being outside of the enclosure, the application is considered to be in an anomalous (i.e., failing) state. This paper presents experimental results that illustrate the advantages of using a computational geometry approach over the distance based approaches of Chi-Squared and Mahalanobis distance. Additionally, we present results illustrating the advantages of using the convex-hull enclosure for fault detection in favor of a simpler enclosure such as a hyperrectangle
FixMe: A Self-organizing Isolated Anomaly Detection Architecture for Large Scale Distributed Systems
- In Proceedings of the 16th International Conference On Principles Of Distributed Systems (OPODIS
, 2012
"... Abstract. Monitoring a system is the ability of collecting and analyzing relevant information provided by the monitored devices so as to be continuously aware of the system state. However, the ever growing complexity and scale of systems makes both real time monitoring and fault detection a quite te ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
(Show Context)
Abstract. Monitoring a system is the ability of collecting and analyzing relevant information provided by the monitored devices so as to be continuously aware of the system state. However, the ever growing complexity and scale of systems makes both real time monitoring and fault detection a quite tedious task. Thus the usually adopted option is to focus solely on a subset of information states, so as to provide coarse-grained indicators. As a consequence, detecting isolated failures or anomalies is a quite challenging issue. In this work, we propose to address this issue by pushing the monitoring task at the edge of the network. We present a peer-to-peer based architecture, which enables nodes to adaptively and efficiently self-organize according to their “health ” indicators. By exploiting both temporal and spatial correlations that exist between a device and its vicinity, our approach guarantees that only isolated anomalies (an anomaly is isolated if it impacts solely a monitored device) are reported on the fly to the network operator. We show that the end-to-end detection process, i.e., from the local detection to the management operator reporting, requires a logarithmic number of messages in the size of the network. 1
OLIC: OnLine Information Compression for Scalable Hosting Infrastructure Monitoring
"... Abstract—Quality-of-service (QoS) management often requires a continuous monitoring service to provide updated information about different hosts and network links in the managed system. However, it is a challenging task to achieve both scalability and precision for monitoring various intra-node and ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
(Show Context)
Abstract—Quality-of-service (QoS) management often requires a continuous monitoring service to provide updated information about different hosts and network links in the managed system. However, it is a challenging task to achieve both scalability and precision for monitoring various intra-node and inter-node metrics (e.g., CPU, memory, disk, network delay) in a large-scale hosting infrastructure. In this paper, we present a novel OnLine Information Compression (OLIC) system to achieve scalable fine-grained hosting infrastructure monitoring. OLIC models continuous snapshots of a hosting infrastructure as a sequence of images and performs online monitoring data compression to significantly reduce the monitoring cost. We have implemented a prototype of the OLIC system and deployed it on the PlanetLab and NCSU’s virtual computing lab (VCL). We have conducted extensive experiments using a set of real monitoring data from VCL, Planetlab, and a Google cluster as well as a real Internet traffic matrix trace. The experimental results show that OLIC can achieve much higher compression ratios with several orders of magnitude less overhead than previous approaches. I.
ABSTRACT OF THE DISSERTATION Decentralized Online Clustering for Supporting Autonomic Management of Distributed Systems
, 2010
"... Distributed computational infrastructures, as well as the applications and services that they support, are increasingly becoming an integral part of society and affecting every aspect of life. As a result, ensuring their efficient and robust operation is critical. How-ever, the scale and overall com ..."
Abstract
- Add to MetaCart
(Show Context)
Distributed computational infrastructures, as well as the applications and services that they support, are increasingly becoming an integral part of society and affecting every aspect of life. As a result, ensuring their efficient and robust operation is critical. How-ever, the scale and overall complexity of these systems is growing at an alarming rate (current data centers contain tens to hundreds of thousands of computing and stor-age devices running complex applications), making the management of these systems extremely challenging and rapidly exceeding human capability. The large quantities of distributed system data, in the form of user and component interaction and status events, contain meaningful information that can be used to infer the states of different components or of the system as a whole. Accurate and timely knowledge of these states is essential for verifying the correctness and efficiency of the operation of the system, as well as for discovering specific situations of interest, such as anomalies or faults, that require the application of appropriate management actions. Autonomic systems/applications must therefore be able to effectively process the large amounts of distributed data and to characterize operational states in a robust,
1 Resilient Self-Compressive Monitoring for Large-Scale Hosting Infrastructures
"... Abstract—Large-scale hosting infrastructures have become the fundamental platforms for many real world systems such as cloud computing infrastructures, enterprise data centers, and massive data processing systems. However, it is a challenging task to achieve both scalability and high precision while ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—Large-scale hosting infrastructures have become the fundamental platforms for many real world systems such as cloud computing infrastructures, enterprise data centers, and massive data processing systems. However, it is a challenging task to achieve both scalability and high precision while monitoring a large number of intra-node and inter-node attributes (e.g., CPU usage, free memory, free disk, inter-node network delay). In this paper, we present the design and implementation of a Resilient self-Compressive Monitoring (RCM) system for large-scale hosting infrastructures. RCM achieves scalable distributed monitoring by performing online data compression to reduce remote data collection cost. RCM provides failure resilience to achieve robust monitoring for dynamic distributed systems where host and network failures are common. We have conducted extensive experiments using a set of real monitoring data from NCSU’s virtual computing lab (VCL), PlanetLab, a Google cluster, and real Internet traffic matrices. The experimental results show that RCM can achieve up to 200 % higher compression ratio and several orders of magnitude less overhead than the existing approaches. Index Terms—Online data compression, distributed system monitoring 1
Thus, we can calculate
"... Without compression, a distributed monitoring system needs to configure all the monitoring agents to periodically report their collected attribute values to the management node. Let ..."
Abstract
- Add to MetaCart
(Show Context)
Without compression, a distributed monitoring system needs to configure all the monitoring agents to periodically report their collected attribute values to the management node. Let
IEEE TRANSACTIONS ON COMPUTERS 1 Enhanced Monitoring-as-a-Service for Effective Cloud
"... Abstract—This paper introduces the concept of monitoring-as-a-service (MaaS), its main components, and a suite of key functional requirements of MaaS in Cloud. We argue that MaaS should support not only the conventional state monitoring capabilities, such as instantaneous violation detection, period ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—This paper introduces the concept of monitoring-as-a-service (MaaS), its main components, and a suite of key functional requirements of MaaS in Cloud. We argue that MaaS should support not only the conventional state monitoring capabilities, such as instantaneous violation detection, periodical state monitoring and single tenant monitoring, but also performance-enhanced functionalities that can optimize on monitoring cost, scalability, and the effectiveness of monitoring service consolidation and isolation. In this paper we present three enhanced MaaS capabilities and show that window based state monitoring is not only more resilient to noises and outliers, but also saves considerable communication cost. Similarly, violation-likelihood based state monitoring can dynamically adjust monitoring intensity based on the likelihood of detecting important events, leading to significant gain in monitoring service consolidation. Finally, multi-tenancy support in state monitoring allows multiple Cloud users to enjoy MaaS with improved performance and efficiency at more affordable cost. We perform extensive experiments in an emulated Cloud environment with real world system and network traces. The experimental results suggest that our MaaS framework achieves significant lower monitoring cost, higher scalability and better multitenancy performance.
1 Self-Compressive Approach for Distributed System Monitoring
"... Large-Scale distributed hosting infrastructures have become the basic platforms for several real-world production systems. But a challenging task is to achieve both scalability and high precision while monitoring a large number of intra-node attributes that contain information relating to each node ..."
Abstract
- Add to MetaCart
(Show Context)
Large-Scale distributed hosting infrastructures have become the basic platforms for several real-world production systems. But a challenging task is to achieve both scalability and high precision while monitoring a large number of intra-node attributes that contain information relating to each node and inter-node attribute that denote measurements between different nodes. This paper presents a new distributed monitoring framework Based on video coding techniques of named RBOIC (Replica Based Online Information