Results 1 - 10
of
64
Capturing, Indexing, Clustering, and Retrieving System History.
- In ACM Symposium on Operating Systems Principles (SOSP),
, 2005
"... ABSTRACT We present a method for automatically extracting from a running system an indexable signature that distills the essential characteristic from a system state and that can be subjected to automated clustering and similarity-based retrieval to identify when an observed system state is similar ..."
Abstract
-
Cited by 120 (8 self)
- Add to MetaCart
(Show Context)
ABSTRACT We present a method for automatically extracting from a running system an indexable signature that distills the essential characteristic from a system state and that can be subjected to automated clustering and similarity-based retrieval to identify when an observed system state is similar to a previously-observed state. This allows operators to identify and quantify the frequency of recurrent problems, to leverage previous diagnostic efforts, and to establish whether problems seen at different installations of the same site are similar or distinct. We show that the naive approach to constructing these signatures based on simply recording the actual "raw" values of collected measurements is ineffective, leading us to a more sophisticated approach based on statistical modeling and inference. Our method requires only that the system's metric of merit (such as average transaction response time) as well as a collection of lower-level operational metrics be collected, as is done by existing commercial monitoring tools. Even if the traces have no annotations of prior diagnoses of observed incidents (as is typical), our technique successfully clusters system states corresponding to similar problems, allowing diagnosticians to identify recurring problems and to characterize the "syndrome" of a group of problems. We validate our approach on both synthetic traces and several weeks of production traces from a customer-facing geoplexed 24 × 7 system; in the latter case, our approach identified a recurring problem that had required extensive manual diagnosis, and also aided the operators in correcting a previous misdiagnosis of a different problem.
Research challenges of autonomic computing
- In Proc. of Int. Conf. on Software Eng
"... Autonomic computing is a grand-challenge vision of the future in which computing systems will manage themselves in accordance with high-level objectives specified by humans. The IT industry recognizes that meeting this challenge is imperative; otherwise, IT systems will soon become virtually impossi ..."
Abstract
-
Cited by 88 (0 self)
- Add to MetaCart
(Show Context)
Autonomic computing is a grand-challenge vision of the future in which computing systems will manage themselves in accordance with high-level objectives specified by humans. The IT industry recognizes that meeting this challenge is imperative; otherwise, IT systems will soon become virtually impossible to administer. But meeting this challenge is also extremely difficult, and will require a worldwide collaboration among the best minds of academia and industry. In the hope of motivating researchers in relevant areas to apply their expertise to this vitally important problem, I outline some of the main scientific and engineering challenges that collectively make up the grand challenge of autonomic computing, and
Fingerprinting the datacenter: Automated classification of performance crises
- In Proceedings of EuroSys’10
, 2010
"... Contemporary datacenters comprise hundreds or thousands of machines running applications requiring high availability and responsiveness. Although a performance crisis is easily detected by monitoring key end-to-end performance indicators (KPIs) such as response latency or request throughput, the var ..."
Abstract
-
Cited by 51 (1 self)
- Add to MetaCart
(Show Context)
Contemporary datacenters comprise hundreds or thousands of machines running applications requiring high availability and responsiveness. Although a performance crisis is easily detected by monitoring key end-to-end performance indicators (KPIs) such as response latency or request throughput, the variety of conditions that can lead to KPI degradation makes it difficult to select appropriate recovery actions. We propose and evaluate a methodology for automatic classification and identification of crises, and in particular for detecting whether a given crisis has been seen before, so that a known solution may be immediately applied. Our approach is based on a new and efficient representation of the datacenter’s state called a fingerprint, constructed by statistical selection and summarization of the hundreds of performance metrics typically collected on such systems. Our evaluation uses 4 months of trouble-ticket data from a production datacenter with hundreds of machines running a 24x7 enterprise-class user-facing application. In experiments in a realistic and rigorous operational setting, our approach provides operators the information necessary to initiate recovery actions with 80 % correctness in an average of 10 minutes, which is 50 minutes earlier than the deadline provided to us by the operators. To the best of our knowledge this is Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
Vconf: a reinforcement learning approach to virtual machines auto-configuration
- In ICAC
, 2009
"... Virtual machine (VM) technology enables multiple VMs to share resources on the same host. Resources allocated to the VMs should be re-configured dynamically in response to the change of application demands or resource supply. Because VM execution involves privileged domain and VM monitor, this cause ..."
Abstract
-
Cited by 48 (18 self)
- Add to MetaCart
(Show Context)
Virtual machine (VM) technology enables multiple VMs to share resources on the same host. Resources allocated to the VMs should be re-configured dynamically in response to the change of application demands or resource supply. Because VM execution involves privileged domain and VM monitor, this causes uncertainties in VMs ’ resource to performance mapping and poses challenges in online determination of appropriate VM configurations. In this paper, we propose a reinforcement learning (RL) based approach, namely VCONF, to automate the VM configuration process. VCONF employs model-based RL algorithms to address the scalability and adaptability issues in applying RL in real system management. Experimental results on both controlled environments and a testbed imitating production systems with Xen VMs and representative server workloads demonstrate the effectiveness of VCONF. The approach is able to find optimal (near optimal) configurations in small scale systems and shows good adaptability and scalability.
SCADS: Scale-Independent Storage for Social Computing Applications
"... Collaborative web applications such as Facebook, Flickr and Yelp present new challenges for storing and querying large amounts of data. As users and developers are focused more on performance than single copy consistency or the ability to perform ad-hoc queries, there exists an opportunity for a hig ..."
Abstract
-
Cited by 38 (5 self)
- Add to MetaCart
(Show Context)
Collaborative web applications such as Facebook, Flickr and Yelp present new challenges for storing and querying large amounts of data. As users and developers are focused more on performance than single copy consistency or the ability to perform ad-hoc queries, there exists an opportunity for a highly-scalable system tailored specifically for relaxed consistency and pre-computed queries. The Web 2.0 development model demands the ability to both rapidly deploy new features and automatically scale with the number of users. There have been many successful distributed keyvalue stores, but so far none provide as rich a query language as SQL. We propose a new architecture, SCADS, that allows the developer to declaratively state application specific consistency requirements, takes advantage of utility computing to provide cost effective scale-up and scale-down, and will use machine learning models to introspectively anticipate performance problems and predict the resource requirements of new queries before execution. 1.
Automated performance analysis of load tests
- In Software Maintenance, 2009. ICSM 2009. IEEE International Conference on
, 2009
"... The goal of a load test is to uncover functional and per-formance problems of a system under load. Performance problems refer to the situations where a system suffers from unexpectedly high response time or low throughput. It is difficult to detect performance problems in a load test due to the abse ..."
Abstract
-
Cited by 30 (16 self)
- Add to MetaCart
(Show Context)
The goal of a load test is to uncover functional and per-formance problems of a system under load. Performance problems refer to the situations where a system suffers from unexpectedly high response time or low throughput. It is difficult to detect performance problems in a load test due to the absence of formally-defined performance objectives and the large amount of data that must be examined. In this paper, we present an approach which automati-cally analyzes the execution logs of a load test for perfor-mance problems. We first derive the system’s performance baseline from previous runs. Then we perform an in-depth performance comparison against the derived performance baseline. Case studies show that our approach produces few false alarms (with a precision of 77%) and scales well to large industrial systems. 1.
Short Term Performance Forecasting in Enterprise Systems
- In ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining
, 2005
"... IT system performance, forecasting algorithms, time series analysis, Bayesian networks We use data mining and machine learning techniques to predict upcoming periods of high utilization or poor performance in enterprise systems. The objective is to automate assignment of resources to stabilize perfo ..."
Abstract
-
Cited by 30 (0 self)
- Add to MetaCart
(Show Context)
IT system performance, forecasting algorithms, time series analysis, Bayesian networks We use data mining and machine learning techniques to predict upcoming periods of high utilization or poor performance in enterprise systems. The objective is to automate assignment of resources to stabilize performance, ( e.g., adding servers to a cluster) or opportunistic job scheduling (e.g., backups or virus scans). Two factors make this problem suitable for data mining techniques. First, there is abundant data given the state of current commercial monitoring and data collection tools for enterprise systems. Second, the complexity of these systems defies human characterization or static models. We formulate the problem as classification: given current and past information about the system's
Online measurement of the capacity of multi-tier websites using hardware performance counters
, 2007
"... Understanding server capacity is crucial for system ca-pacity planning, configuration, and QoS-aware resource management. Conventional stress testing approaches mea-sure the server capacity in terms of application-level perfor-mance metrics like response time and throughput. They are limited in meas ..."
Abstract
-
Cited by 18 (11 self)
- Add to MetaCart
(Show Context)
Understanding server capacity is crucial for system ca-pacity planning, configuration, and QoS-aware resource management. Conventional stress testing approaches mea-sure the server capacity in terms of application-level perfor-mance metrics like response time and throughput. They are limited in measurement accuracy and timeliness. In a multi-tier website, resource bottleneck often shifts between tiers as client access pattern changes. This makes the capacity measurement even more challenging. This paper presents a measurement approach based on hardware performance counter metrics. The approach uses machine learning tech-niques to infer application-level performance at each tier. A coordinated predictor is induced over individual tier models to estimate system-wide performance and identify the bot-tleneck when the system becomes overloaded. Experimental results demonstrate that this approach is able to achieve an overload prediction accuracy of higher than 90 % for a priori known input traffic patterns and over 85 % accuracy even for traffic causing frequent bottleneck shifting. It costs less than 0.5 % runtime overhead for data collection and no more than 50 ms for each on-line decision. I.
Adaptive system anomaly prediction for large-scale hosting infrastructures
- in PODC
, 2010
"... Large-scale hosting infrastructures require automatic system anomaly management to achieve continuous system operation. In this paper, we present a novel adaptive runtime anomaly prediction system, called ALERT, to achieve robust hosting infrastructures. In contrast to traditional anomaly detection ..."
Abstract
-
Cited by 17 (6 self)
- Add to MetaCart
(Show Context)
Large-scale hosting infrastructures require automatic system anomaly management to achieve continuous system operation. In this paper, we present a novel adaptive runtime anomaly prediction system, called ALERT, to achieve robust hosting infrastructures. In contrast to traditional anomaly detection schemes, ALERT aims at raising advance anomaly alerts to achieve just-in-time anomaly prevention. We propose a novel context-aware anomaly prediction scheme to improve prediction accuracy in dynamic hosting infrastructures. We have implemented the ALERT system and deployed it on several production hosting infrastructures such as IBM System S stream processing cluster and PlanetLab. Our experiments show that ALERT can achieve high prediction accuracy for a range of system anomalies and impose low overhead to the hosting infrastructure.
Online Anomaly Prediction for Robust Cluster Systems
- in Proc. of ICDE
, 2009
"... Abstract—In this paper, we present a stream-based mining algorithm for online anomaly prediction. Many real-world applications such as data stream analysis requires continuous cluster operation. Unfortunately, today’s large-scale cluster systems are still vulnerable to various software and hardware ..."
Abstract
-
Cited by 17 (7 self)
- Add to MetaCart
(Show Context)
Abstract—In this paper, we present a stream-based mining algorithm for online anomaly prediction. Many real-world applications such as data stream analysis requires continuous cluster operation. Unfortunately, today’s large-scale cluster systems are still vulnerable to various software and hardware problems. System administrators are often overwhelmed by the tasks of correcting various system anomalies such as processing bottlenecks (i.e., full stream buffers), resource hot spots, and service level objective (SLO) violations. Our anomaly prediction scheme raises early alerts for impending system anomalies and suggests possible anomaly causes. Specifically, we employ Bayesian classification methods to capture different anomaly symptoms and infer anomaly causes. Markov models are introduced to capture the changing patterns of different measurement metrics. More importantly, our scheme combines Markov models and Bayesian classification methods to predict when a system anomaly will appear in the foreseeable future and what are the possible anomaly causes. To the best of our knowledge, our work provides the first stream-based mining algorithm for predicting system anomalies. We have implemented our approach within the IBM System S distributed stream processing cluster, and conducted case study experiments using fully implemented distributed data analysis applications processing real application workloads. Our experiments show that our approach efficiently predicts and diagnoses several bottleneck anomalies with high accuracy while imposing low overhead to the cluster system. I.