Results 1 - 10
of
24
The impact of memory subsystem resource sharing on datacenter applications.
- In 38th Int’l Symp. on Computer Architecture (ISCA),
, 2011
"... ABSTRACT In this paper we study the impact of sharing memory resources on five Google datacenter applications: a web search engine, bigtable, content analyzer, image stitching, and protocol buffer. While prior work has found neither positive nor negative effects from cache sharing across the PARSEC ..."
Abstract
-
Cited by 54 (8 self)
- Add to MetaCart
(Show Context)
ABSTRACT In this paper we study the impact of sharing memory resources on five Google datacenter applications: a web search engine, bigtable, content analyzer, image stitching, and protocol buffer. While prior work has found neither positive nor negative effects from cache sharing across the PARSEC benchmark suite, we find that across these datacenter applications, there is both a sizable benefit and a potential degradation from improperly sharing resources. In this paper, we first present a study of the importance of thread-tocore mappings for applications in the datacenter as threads can be mapped to share or to not share caches and bus bandwidth. Second, we investigate the impact of co-locating threads from multiple applications with diverse memory behavior and discover that the best mapping for a given application changes depending on its co-runner. Third, we investigate the application characteristics that impact performance in the various thread-to-core mapping scenarios. Finally, we present both a heuristics-based and an adaptive approach to arrive at good thread-to-core decisions in the datacenter. We observe performance swings of up to 25% for web search and 40% for other key applications, simply based on how application threads are mapped to cores. By employing our adaptive thread-to-core mapper, the performance of the datacenter applications presented in this work improved by up to 22% over status quo thread-to-core mapping and performs within 3% of optimal.
ResourceFreeing Attacks: Improve Your Cloud Performance (at Your Neighbor’s Expense
- In ACM CCS
, 2012
"... Cloud computing promises great efficiencies by multiplexing resources among disparate customers. For example, Amazon’s Elastic Compute Cloud (EC2), Microsoft Azure, Google’s Compute Engine, and Rackspace Hosting all offer Infrastructure as a Service (IaaS) solutions that pack multiple customer virtu ..."
Abstract
-
Cited by 24 (1 self)
- Add to MetaCart
(Show Context)
Cloud computing promises great efficiencies by multiplexing resources among disparate customers. For example, Amazon’s Elastic Compute Cloud (EC2), Microsoft Azure, Google’s Compute Engine, and Rackspace Hosting all offer Infrastructure as a Service (IaaS) solutions that pack multiple customer virtual machines (VMs) onto the same physical server. The gained efficiencies have some cost: past work has shown that the performance of one customer’s VM can suffer due to interference from another. In experiments on a local testbed, we found that the performance of a cache-sensitive benchmark can degrade by more than 80 % because of interference from another VM. This interference incentivizes a new class of attacks, that we call resource-freeing attacks (RFAs). The goal is to modify the workload of a victim VM in a way that frees up resources for the attacker’s VM. We explore in depth a particular example of an RFA. Counter-intuitively, by adding load to a co-resident victim, the attack speeds up a class of cache-bound workloads. In a controlled lab setting we show that this can improve performance of synthetic benchmarks by up to 60 % over not running the attack. In the noisier setting of Amazon’s EC2, we still show improvements of up to 13%.
Measuring Interference Between Live Datacenter Applications
"... Abstract—Application interference is prevalent in datacenters due to contention over shared hardware resources. Unfortunately, understanding interference in live datacenters is more difficult than in controlled environments or on simpler architectures. Most approaches to mitigating interference rely ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
(Show Context)
Abstract—Application interference is prevalent in datacenters due to contention over shared hardware resources. Unfortunately, understanding interference in live datacenters is more difficult than in controlled environments or on simpler architectures. Most approaches to mitigating interference rely on data that cannot be collected efficiently in a production environment. This work exposes eight specific complexities of live datacenters that constrain measurement of interference. It then introduces new, generic measurement techniques for analyzing interference in the face of these challenges and restrictions. We use the measurement techniques to conduct the first large-scale study of application interference in live production datacenter workloads. Data is measured across 1000 12-core Google servers observed to be running 1102 unique applications. Finally, our work identifies several opportunities to improve performance that use only the available data; these opportunities are applicable to any datacenter. I.
Brownout: building more robust cloud applications
"... Self-adaptation is a first class concern for cloud applications, which should be able to withstand diverse runtime changes. Variations are simultaneously happening both at the cloud infrastructure level — for example hardware failures — and at the user workload level — flash crowds. However, robustl ..."
Abstract
-
Cited by 12 (7 self)
- Add to MetaCart
(Show Context)
Self-adaptation is a first class concern for cloud applications, which should be able to withstand diverse runtime changes. Variations are simultaneously happening both at the cloud infrastructure level — for example hardware failures — and at the user workload level — flash crowds. However, robustly withstanding extreme variability, requires costly hardware over-provisioning. In this paper, we introduce a self-adaptation programming paradigm called brownout. Using this paradigm, applications can be designed to robustly withstand unpredictable runtime variations, without over-provisioning. The paradigm is based on optional code that can be dynamically deactivated through decisions based on control theory. We modified two popular web application prototypes — RU-BiS and RUBBoS — with less than 170 lines of code, to make them brownout-compliant. Experiments show that brownout self-adaptation dramatically improves the ability to withstand flash-crowds and hardware failures.
Bhuyan: No More Backstabbing... A Faithful Scheduling Policy for Multithreaded Programs
, 2011
"... Abstract—Efficient contention management is the key to achieving scalable performance for multithreaded applications running on multicore systems. However, contention management policies provided by modern operating systems increase context-switches and lead to performance degradation for multithrea ..."
Abstract
-
Cited by 9 (4 self)
- Add to MetaCart
Abstract—Efficient contention management is the key to achieving scalable performance for multithreaded applications running on multicore systems. However, contention management policies provided by modern operating systems increase context-switches and lead to performance degradation for multithreaded applications under high loads. Moreover, this problem is exacerbated by the interaction between contention management policies and OS scheduling polices. Time Share (TS) is the default scheduling policy in a modern OS such as OpenSolaris and with TS policy, priorities of threads change very frequently for balancing load and providing fairness in scheduling. Due to the frequent ping-ponging of priorities, threads of an application are often preempted by the threads of the same application. This increases the frequency of involuntary context-switches as wells as lock-holder thread preemptions and leads to poor performance. This problem becomes very serious under high loads. To alleviate this problem, in this paper, we present a scheduling policy called Faithful Scheduling (FF), which dramatically reduces context-switches as well as lock-holder thread preemptions. We implemented FF on a 24-core Dell PowerEdge R905 server running OpenSolaris.2009.06 and evaluated it using 22 programs including the TATP database application, SPECjbb2005, programs from PARSEC, SPEC OMP, and some microbenchmarks. The experimental results show that FF policy achieves high performance for both lightly and heavily loaded systems. Moreover it does not require any changes to the application source code or the OS kernel. Keywords-Scheduling; priorities; contention; contextswitches I.
A ADAPT: A Framework for Coscheduling Multithreaded Programs
"... Since multicore systems offer greater performance via parallelism, future computing is progressing towards use of multicore machines with large number of cores. However, the performance of emerging multithreaded programs often does not scale to fully utilize the available cores. Therefore, simultane ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
Since multicore systems offer greater performance via parallelism, future computing is progressing towards use of multicore machines with large number of cores. However, the performance of emerging multithreaded programs often does not scale to fully utilize the available cores. Therefore, simultaneously running multiple multithreaded applications becomes inevitable to fully exploit such machines. However, multicore machines pose a challenge for the OS with respect to maximizing performance and throughput in the presence of multiple multithreaded programs. We have observed that the state-of-the-art contention management algorithms fail to effectively coschedule multithreaded programs on multicore machines. To address the above challenge, we present ADAPT, a scheduling framework that continuously monitors the resource usage of multithreaded programs and adaptively coschedules them such that they interfere with each other’s performance as little as possible. In addition, it adaptively selects appropriate memory allocation and scheduling policies according to the workload characteristics. We have implemented ADAPT on a 64-core Supermicro server running Solaris 11 and evaluated it using 26 multithreaded programs including the TATP database application, SPECjbb2005, programs from Phoenix, PARSEC, and SPEC OMP. The experimental results show that ADAPT substantially improves total turnaround time and system utilization relative to the default Solaris 11 scheduler.
Chameleon: Operating System Support for Dynamic Processors
"... The rise of multi-core processors has shifted performance efforts towards parallel programs. However, single-threaded code, whether from legacy programs or ones difficult to parallelize, remains important. Proposed asymmetric multicore processors statically dedicate hardware to improve sequential pe ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
(Show Context)
The rise of multi-core processors has shifted performance efforts towards parallel programs. However, single-threaded code, whether from legacy programs or ones difficult to parallelize, remains important. Proposed asymmetric multicore processors statically dedicate hardware to improve sequential performance, but at the cost of reduced parallel performance. However, several proposed mechanisms provide the best-ofboth-worlds by combining multiple cores into a single, more powerful processor for sequential code. For example, Core Fusion merges multiple cores to pool caches and functional units, and Intel’s Turbo Boost raises the clock speed of a core if the other cores on a chip are powered down. These reconfiguration mechanisms have two important properties. First the set of available cores and their capabilities can vary over short time scales. Current operating systems are not designed for rapidly changing hardware: the existing hotplug mechanisms for reconfiguring processors require global operations and hundreds of milliseconds to complete. Second, configurations may be mutually exclusive: using power to speed one core means it cannot be used to speed another. Current schedulers cannot manage this requirement. We present Chameleon, an extension to Linux to support dynamic processors that can reconfigure their cores at runtime. Chameleon provides processor proxies to enable rapid reconfiguration, execution objects to abstract the processing capabilities of physical CPUs, and a cluster scheduler to balance the needs of sequential and parallel programs. In experiments that emulate a dynamic processor, we find that Chameleon can reconfigure processors 100,000 times faster than Linux and allows applications full access to hardware capabilities: sequential code runs at full speed on a powerful execution context, while parallel code runs on as many cores as possible.
WATS: Workload-Aware Task Scheduling in Asymmetric Multi-core Architectures
- In IPDPS’12. IEEE
, 2012
"... Abstract—Asymmetric Multi-Core (AMC) architectures have shown high performance as well as power efficiency. However, current parallel programming environments do not perform well on AMC due to their assumption that all cores are symmetric and provide equal performance. Their random task scheduling p ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
Abstract—Asymmetric Multi-Core (AMC) architectures have shown high performance as well as power efficiency. However, current parallel programming environments do not perform well on AMC due to their assumption that all cores are symmetric and provide equal performance. Their random task scheduling policies, such as task-stealing, can result in unbalanced workloads in AMC and severely degrade the per-formance of parallel applications. To balance the workloads of parallel applications in AMC, this paper proposes a Workload-Aware Task Scheduling (WATS) scheme that adopts history-based task allocation and preference-based task stealing. The history-based task allocation is based on a near-optimal, static task allocation using the historical statistics collected during the execution of a parallel application. The preference-based task stealing, which steals tasks based on a preference list, can dynamically adjust the workloads in AMC if the task allocation is less optimal due to approximation in the history-based task allocation. Experimental results show that WATS can improve the performance of CPU-bound applications up to 82.7 % compared with the random task scheduling policies.
Parallel Discrete Event Simulation for Multi-core Systems: Analysis and Optimization
"... Abstract—Parallel Discrete Event Simulation (PDES) can substantially improve the performance and capacity of simulation, allowing the study of larger, more detailed models, in less time. PDES is a fine-grained parallel application whose performance and scalability is limited by communication latenci ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
(Show Context)
Abstract—Parallel Discrete Event Simulation (PDES) can substantially improve the performance and capacity of simulation, allowing the study of larger, more detailed models, in less time. PDES is a fine-grained parallel application whose performance and scalability is limited by communication latencies. Traditionally, PDES simulation kernels use message passing; often these simulators are written for distributed environments, and shared memory is used to optimize message passing among processes on the same machine. In this paper, we develop, characterize and optimize a thread-based version of a PDES simulator on three representative multi-core platforms. The multi-threaded implementation eliminates multiple message copying and significantly minimizes synchronization delays. We study the performance of the simulator on three hardware platforms: an Intel Core i7 machine, and a 48-core AMD Opteron Magny-Cours system, and a 64-core Tilera TilePro64. We discover that the three platforms encounter substantially different bottlenecks because of their different architectures. We identify these bottlenecks and propose mechanisms to overcome them. Our results show that multi-threaded implementation improves the performance over an MPI-based version by up to a factor of 3 on the Core i7, 1.4 on the AMD Magny-Cours, and 2.8 on the Tilera Tile64.
Adaptive Power and Resource Management Techniques for Multithreaded Workloads
- In Proceedings of IEEE International Parallel and Distributed Processing Symposium – Workshops and PhD Forum
, 2013
"... Abstract-As today's computing trends are moving towards the cloud, meeting the increasing computational demand while minimizing the energy costs in data centers has become essential. This work introduces two adaptive techniques to reduce the energy consumption of the computing clusters through ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Abstract-As today's computing trends are moving towards the cloud, meeting the increasing computational demand while minimizing the energy costs in data centers has become essential. This work introduces two adaptive techniques to reduce the energy consumption of the computing clusters through power and resource management on multi-core processors. We first present a novel power capping technique to constrain the power consumption of computing nodes. Our technique combines Dynamic Voltage-Frequency Scaling (DVFS) and thread allocation on multi-core systems. By utilizing machine learning techniques, our power capping method is able to meet the power budgets 82% of the time without requiring any power measurement device and reduces the energy consumption by 51.6% on average in comparison to the state-of-the-art techniques. We then introduce an autonomous resource management technique for consolidated multi-threaded workloads running on multi-core servers. Our technique first classifies applications according to their energy efficiency measure, then proportionally allocates resources for co-scheduled applications to improve the energy efficiency. The proposed technique improves the energy efficiency by 17% in comparison to state-of-the-art co-scheduling policies. I. INTRODUCTION Energy-related costs are among the major contributors to the total cost of ownership of today's data centers and high performance computing (HPC) clusters. Therefore, future computing clusters are required to be energy-efficient in order to be able to meet the continuously increasing computational demand. Moreover, administration and management of the data center resources has become significantly complex, due to increasing number of servers installed on data centers. Therefore, designing autonomous techniques to optimally manage the limited data center resources is essential to achieve sustainability in the cloud era. The achievable maximum performance of a computing cluster is determined by (1) infrastructural/cost limitations (e.g, power delivery, cooling capacity, electricity cost) and/or (2) available hardware resources (e.g., CPU, disk size). Optimizing the performance under such constraints (i,e., power, resource) is critically important to improve the energy efficiency, therefore to reduce to cost of computing. Moreover, the emergence of multi-threaded applications on cloud resources bring additional challenges for optimizing the performanceenergy tradeoffs under resource constraints, due to their complex characteristics such as performance scalability and intercore communication. In this work, we present two adaptive management techniques for multi-threaded workloads to improve the energy