Results 1 - 10
of
50
Paragon: Qos-aware scheduling for heterogeneous datacenters
- In Proceedings of the eighteenth international
, 2013
"... Large-scale datacenters (DCs) host tens of thousands of diverse applications each day. However, interference between colocated workloads and the difficulty to match applications to one of the many hardware platforms available can degrade performance, violating the quality of service (QoS) guarantees ..."
Abstract
-
Cited by 32 (7 self)
- Add to MetaCart
Large-scale datacenters (DCs) host tens of thousands of diverse applications each day. However, interference between colocated workloads and the difficulty to match applications to one of the many hardware platforms available can degrade performance, violating the quality of service (QoS) guarantees that many cloud workloads require. While previous work has identified the impact of heterogeneity and interference, existing solutions are computationally intensive, cannot be applied online and do not scale beyond few applications. We present Paragon, an online and scalable DC scheduler that is heterogeneity and interference-aware. Paragon is derived from robust analytical methods and instead of profiling each application in detail, it leverages information the system already has about applications it has previously seen. It uses collaborative filtering techniques to quickly and accurately classify an unknown, incoming workload with respect to heterogeneity and interference in multiple shared resources, by identifying similarities to previously scheduled applications. The classification allows Paragon to greedily schedule applications in a manner that minimizes interference and maximizes server utilization. Paragon scales to tens of thousands of servers with marginal scheduling overheads in terms of time or state. We evaluate Paragon with a wide range of workload scenarios, on both small and large-scale systems, including 1,000 servers on EC2. For a 2,500-workload scenario, Paragon enforces performance guarantees for 91 % of applications, while significantly improving utilization. In comparison, heterogeneity-oblivious, interference-oblivious and least-loaded schedulers only provide similar guarantees for 14%, 11 % and 3 % of workloads. The differences are more striking in oversubscribed scenarios where resource efficiency is more critical.
CPI 2 : CPU performance isolation for shared compute clusters
- In EuroSys
, 2013
"... Performance isolation is a key challenge in cloud computing. Unfortunately, Linux has few defenses against performance interference in shared resources such as processor caches and memory buses, so applications in a cloud can experience unpredictable performance caused by other programs’ behavior. O ..."
Abstract
-
Cited by 27 (1 self)
- Add to MetaCart
(Show Context)
Performance isolation is a key challenge in cloud computing. Unfortunately, Linux has few defenses against performance interference in shared resources such as processor caches and memory buses, so applications in a cloud can experience unpredictable performance caused by other programs’ behavior. Our solution, CPI 2, uses cycles-per-instruction (CPI) data obtained by hardware performance counters to identify problems, select the likely perpetrators, and then optionally throttle them so that the victims can return to their expected behavior. It automatically learns normal and anomalous behaviors by aggregating data from multiple tasks in the same job. We have rolled out CPI 2 to all of Google’s shared compute clusters. The paper presents the analysis that lead us to that outcome, including both case studies and a large-scale evaluation of its ability to solve real production issues. 1.
Dynamic energy-aware capacity provisioning for cloud computing environments
- In Proc. IEEE/ACM Int. Conf. Autonomic Computing (ICAC
, 2012
"... Data centers have recently gained significant popularity as a cost-effective platform for hosting large-scale service appli-cations. While large data centers enjoy economies of scale by amortizing initial capital investment over large number of machines, they also incur tremendous energy cost in ter ..."
Abstract
-
Cited by 19 (2 self)
- Add to MetaCart
(Show Context)
Data centers have recently gained significant popularity as a cost-effective platform for hosting large-scale service appli-cations. While large data centers enjoy economies of scale by amortizing initial capital investment over large number of machines, they also incur tremendous energy cost in terms of power distribution and cooling. An effective approach for saving energy in data centers is to adjust dynamically the data center capacity by turning off unused machines. How-ever, this dynamic capacity provisioning problem is known to be challenging as it requires a careful understanding of the resource demand characteristics as well as considerations to various cost factors, including task scheduling delay, ma-chine reconfiguration cost and electricity price fluctuation. In this paper, we provide a control-theoretic solution to the dynamic capacity provisioning problem that minimizes the total energy cost while meeting the performance objec-tive in terms of task scheduling delay. Specifically, we model this problem as a constrained discrete-time optimal control problem, and use Model Predictive Control (MPC) to find the optimal control policy. Through extensive analysis and simulation using real workload traces from Google’s compute clusters, we show that our proposed framework can achieve significant reduction in energy cost, while maintaining an acceptable average scheduling delay for individual tasks.
Quasar: Resource-Efficient and QoS-Aware Cluster Management
"... Cloud computing promises flexibility and high performance for users and high cost-efficiency for operators. Neverthe-less, most cloud facilities operate at very low utilization, hurting both cost effectiveness and future scalability. We present Quasar, a cluster management system that increases reso ..."
Abstract
-
Cited by 16 (4 self)
- Add to MetaCart
(Show Context)
Cloud computing promises flexibility and high performance for users and high cost-efficiency for operators. Neverthe-less, most cloud facilities operate at very low utilization, hurting both cost effectiveness and future scalability. We present Quasar, a cluster management system that increases resource utilization while providing consistently high application performance. Quasar employs three tech-niques. First, it does not rely on resource reservations, which lead to underutilization as users do not necessarily understand workload dynamics and physical resource re-quirements of complex codebases. Instead, users express performance constraints for each workload, letting Quasar determine the right amount of resources to meet these con-straints at any point. Second, Quasar uses classification tech-niques to quickly and accurately determine the impact of the amount of resources (scale-out and scale-up), type of resources, and interference on performance for each work-load and dataset. Third, it uses the classification results to jointly perform resource allocation and assignment, quickly exploring the large space of options for an efficient way to pack workloads on available resources. Quasar monitors workload performance and adjusts resource allocation and assignment when needed. We evaluate Quasar over a wide range of workload scenarios, including combinations of dis-tributed analytics frameworks and low-latency, stateful ser-vices, both on a local cluster and a cluster of dedicated EC2 servers. At steady state, Quasar improves resource utiliza-tion by 47 % in the 200-server EC2 cluster, while meeting performance constraints for workloads of all types.
PREPARE: Predictive Performance Anomaly Prevention for Virtualized Cloud Systems
"... Abstract—Virtualized cloud systems are prone to performance anomalies due to various reasons such as resource contentions, software bugs, and hardware failures. In this paper, we present a novel PREdictive Performance Anomaly pREvention (PREPARE) system that provides automatic performance anomaly pr ..."
Abstract
-
Cited by 13 (4 self)
- Add to MetaCart
(Show Context)
Abstract—Virtualized cloud systems are prone to performance anomalies due to various reasons such as resource contentions, software bugs, and hardware failures. In this paper, we present a novel PREdictive Performance Anomaly pREvention (PREPARE) system that provides automatic performance anomaly prevention for virtualized cloud computing infrastructures. PREPARE integrates online anomaly prediction, learning-based cause inference, and predictive prevention actuation to minimize the performance anomaly penalty without human intervention. We have implemented PREPARE on top of the Xen platform and tested it on the NCSU’s Virtual Computing Lab using a commercial data stream processing system (IBM System S) and an online auction benchmark (RUBiS). The experimental results show that PREPARE can effectively prevent performance anomalies while imposing low overhead to the cloud infrastructure. Index Terms—performance anomaly prevention, online anomaly prediction, cloud computing I.
AGILE: elastic distributed resource scaling for Infrastructure-as-a-Service
"... Dynamically adjusting the number of virtual machines (VMs) assigned to a cloud application to keep up with load changes and interference from other uses typically requires detailed application knowledge and an ability to know the future, neither of which are readily available to infrastructure servi ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
(Show Context)
Dynamically adjusting the number of virtual machines (VMs) assigned to a cloud application to keep up with load changes and interference from other uses typically requires detailed application knowledge and an ability to know the future, neither of which are readily available to infrastructure service providers or application owners. The result is that systems need to be over-provisioned (costly), or risk missing their performance Service Level Objectives (SLOs) and have to pay penalties (also costly). AGILE deals with both issues: it uses wavelets to provide a medium-term resource demand prediction with enough lead time to start up new application server instances before performance falls short, and it uses dynamic VM cloning to reduce application startup times. Tests using RUBiS and Google cluster traces show that AGILE can predict varying resource demands over the medium-term with up to 3.42 × better true positive rate and 0.34 × the false positive rate than existing schemes. Given a target SLO violation rate, AGILE can efficiently handle dynamic application workloads, reducing both penalties and user dissatisfaction. 1
Brownout: building more robust cloud applications
"... Self-adaptation is a first class concern for cloud applications, which should be able to withstand diverse runtime changes. Variations are simultaneously happening both at the cloud infrastructure level — for example hardware failures — and at the user workload level — flash crowds. However, robustl ..."
Abstract
-
Cited by 10 (7 self)
- Add to MetaCart
(Show Context)
Self-adaptation is a first class concern for cloud applications, which should be able to withstand diverse runtime changes. Variations are simultaneously happening both at the cloud infrastructure level — for example hardware failures — and at the user workload level — flash crowds. However, robustly withstanding extreme variability, requires costly hardware over-provisioning. In this paper, we introduce a self-adaptation programming paradigm called brownout. Using this paradigm, applications can be designed to robustly withstand unpredictable runtime variations, without over-provisioning. The paradigm is based on optional code that can be dynamically deactivated through decisions based on control theory. We modified two popular web application prototypes — RU-BiS and RUBBoS — with less than 170 lines of code, to make them brownout-compliant. Experiments show that brownout self-adaptation dramatically improves the ability to withstand flash-crowds and hardware failures.
Transactional Auto Scaler: Elastic Scaling of In-Memory Transactional Data Grids
"... In this paper we introduce TAS (Transactional Auto Scaler), a system for automating elastic-scaling of in-memory transactional data grids, such as NoSQL data stores or Distributed Transactional Memories. Applications of TAS range from on-line self-optimization of in-production applications to automa ..."
Abstract
-
Cited by 7 (6 self)
- Add to MetaCart
(Show Context)
In this paper we introduce TAS (Transactional Auto Scaler), a system for automating elastic-scaling of in-memory transactional data grids, such as NoSQL data stores or Distributed Transactional Memories. Applications of TAS range from on-line self-optimization of in-production applications to automatic generation of QoS/cost driven elastic scaling policies, and support for what-if analysis on the scalability of transactional applications. The key innovation at the core of TAS is a novel performance forecasting methodology that relies on the joint usage of analytical modeling and machine-learning. By exploiting these two, classically competing, methodologies in a synergic fashion, TAS achieves the best of the two worlds, namely high extrapolation power and good accuracy even when faced with complex workloads deployed over public cloud infrastructures. We demonstrate the accuracy and feasibility of TAS via an extensive experimental study based on a fully fledged prototype implementation, integrated with a popular open-source transactional in-memory data store (Red Hat’s Infinispan), and industry-standard benchmarks generating a breadth of heterogeneous workloads.
Adaptive, Model-driven Autoscaling for Cloud Applications
"... Applications with a dynamic workload demand need access to a flexible infrastructure to meet performance guarantees and minimize resource costs. While cloud computing provides the elasticity to scale the infrastruc-ture on demand, cloud service providers lack control and visibility of user space app ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
(Show Context)
Applications with a dynamic workload demand need access to a flexible infrastructure to meet performance guarantees and minimize resource costs. While cloud computing provides the elasticity to scale the infrastruc-ture on demand, cloud service providers lack control and visibility of user space applications, making it difficult to accurately scale the underlying infrastructure. Thus, the burden of scaling falls on the user. In this paper, we propose a new cloud service, Depend-able Compute Cloud (DC2), that automatically scales the infrastructure to meet the user-specified performance re-quirements. DC2 employs Kalman filtering to automati-cally learn the (possibly changing) system parameters for each application, allowing it to proactively scale the in-frastructure to meet performance guarantees. DC2 is de-signed for the cloud- it is application-agnostic and does not require any offline application profiling or bench-marking. Our implementation results on OpenStack us-ing a multi-tier application under a range of workload traces demonstrate the robustness and superiority of DC2 over existing rule-based approaches. 1
Anchor: A Versatile and Efficient Framework for Resource Management in the Cloud
"... Abstract—We present Anchor, a general resource management architecture that uses the stable matching framework to decouple policies from mechanisms when mapping virtual machines to physical servers. In Anchor, clients and operators are able to express a variety of distinct resource management polici ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
(Show Context)
Abstract—We present Anchor, a general resource management architecture that uses the stable matching framework to decouple policies from mechanisms when mapping virtual machines to physical servers. In Anchor, clients and operators are able to express a variety of distinct resource management policies as they deem fit, and these policies are captured as preferences in the stable matching framework. The highlight of Anchor is a new many-to-one stable matching theory that efficiently matches VMs with heterogeneous resource needs to servers, using both offline and online algorithms. Our theoretical analyses show the convergence and optimality of the algorithm. Our experiments with a prototype implementation on a 20-node server cluster, as well as large-scale simulations based on real-world workload traces, demonstrate that the architecture is able to realize a diverse set of policy objectives with good performance and practicality. Index Terms—Cloud computing, resource management, stable matching, VM placement Ç 1