Results 1 - 10
of
16
Thread reinforcer: Dynamically determining number of threads via os level monitoring
- in Workload Characterization (IISWC), 2011 IEEE International Symposium on. IEEE, 2011
"... Abstract—It is often assumed that to maximize the performance of a multithreaded application, the number of threads created should equal the number of cores. While this may be true for systems with four or eight cores, this is not true for systems with larger number of cores. Our experiments with PA ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
(Show Context)
Abstract—It is often assumed that to maximize the performance of a multithreaded application, the number of threads created should equal the number of cores. While this may be true for systems with four or eight cores, this is not true for systems with larger number of cores. Our experiments with PARSEC programs on a 24-core machine demonstrate this. Therefore, dynamically determining the appropriate number of threads for a multithreaded application is an important unsolved problem. In this paper we develop a simple technique for dynamically determining appropriate number of threads without recompiling the application or using complex compilation techniques or modifying Operating System policies. We first present a scalability study of eight programs from PARSEC conducted on a 24 core Dell PowerEdge R905 server running OpenSolaris.2009.06 for numbers of threads ranging from a few threads to 128 threads. Our study shows that not only does the maximum speedup achieved by these programs vary widely (from 3.6x to 21.9x), the number of threads that produce maximum speedups also vary widely (from 16 to 63 threads). By understanding the overall speedup behavior of these programs we identify the critical Operating System level factors that explain why the speedups vary with the number of threads. As an application of these observations, we develop a framework called “Thread Reinforcer” that dynamically monitors program’s execution to search for the number of threads that are likely to yield best speedups. Thread Reinforcer identifies optimal or near optimal number of threads for most of the PARSEC programs studied and as well as for SPEC OMP and PBZIP2 programs. I.
Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems
"... Abstract — Heterogeneous computing on CPUs and GPUs has traditionally used fixed roles for each device: the GPU handles data parallel work by taking advantage of its massive number of cores while the CPU handles non data-parallel work, such as the sequential code or data transfer management. Unfortu ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
(Show Context)
Abstract — Heterogeneous computing on CPUs and GPUs has traditionally used fixed roles for each device: the GPU handles data parallel work by taking advantage of its massive number of cores while the CPU handles non data-parallel work, such as the sequential code or data transfer management. Unfortunately, this work distribution can be a poor solution as it under utilizes the CPU, has difficulty generalizing beyond the single CPU-GPU combination, and may waste a large fraction of time transferring data. Further, CPUs are performance competitive with GPUs on many workloads, thus simply partitioning work based on the fixed roles may be a poor choice. In this paper, we present the single kernel multiple devices (SKMD) system, a framework that transparently orchestrates collaborative execution of a single data-parallel kernel across multiple asymmetric CPUs and GPUs. The programmer is responsible for developing a single dataparallel kernel in OpenCL, while the system automatically partitions the workload across an arbitrary set of devices, generates kernels to execute the partial workloads, and efficiently merges the partial outputs together. The goal is performance improvement by maximally utilizing all available resources to execute the kernel. SKMD handles the difficult challenges of exposed data transfer costs and the performance variations GPUs have with respect to input size. On real hardware, SKMD achieves an average speedup of 29 % on a system with one multicore CPU and two asymmetric GPUs compared to a fastest device execution strategy for a set of popular OpenCL kernels. Index Terms—GPGPU, OpenCL, Collaboration, Data parallel I.
Performance evaluation and analysis of thread pinning strategies on multi-core platforms: Case study of spec omp applications on intel architectures
- in High Performance Computing and Simulation (HPCS), 2011 International Conference on. IEEE, 2011
"... ABSTRACT With the introduction of multi-core processors, thread affinity has quickly appeared to be one of the most important factors to accelerate program execution times. The current article presents a complete experimental study on the performance of various thread pinning strategies. We investi ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
(Show Context)
ABSTRACT With the introduction of multi-core processors, thread affinity has quickly appeared to be one of the most important factors to accelerate program execution times. The current article presents a complete experimental study on the performance of various thread pinning strategies. We investigate four application independent thread pinning strategies and five application sensitive ones based on cache sharing. We made extensive performance evaluation on three different multi-core machines reflecting three usual utilisation: workstation machine, server machine and high performance machine. In overall, we show that fixing thread affinities (whatever the tested strategy) is a better choice for improving program performance on HPC ccNUMA machines compared to OS-based thread placement. This means that the current Linux OS scheduling strategy is not necessarily the best choice in terms of performance on ccNUMA machines, even if it is a good choice in terms of cores usage ratio and work balancing. On smaller Core2 and Nehalem machines, we show that the benefit of thread pinning is not satisfactory in terms of speedups versus OSbased scheduling, but the performance stability is much better.
When Less Is MOre (LIMO): Controlled Parallelism for Improved Efficiency
"... While developing shared-memory programs, programmers often contend with the problem of how many threads to create for best efficiency. Creating as many threads as the number of available processor cores, or more, may not be the most efficient configuration. Too many threads can result in excessive c ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
While developing shared-memory programs, programmers often contend with the problem of how many threads to create for best efficiency. Creating as many threads as the number of available processor cores, or more, may not be the most efficient configuration. Too many threads can result in excessive contention for shared resources, wasting energy, which is of primary concern for embedded devices. Furthermore, thermal and power constraints prevent us from operating all the processor cores at the highest possible frequency, favoring fewer threads. The best number of threads to run depends on the application, user input and hardware resources available. It can also change at runtime making it infeasible for the programmer to determine this number. To address this problem, we propose LIMO, a runtime system that dynamically manages the number of running threads of an application for maximizing peformance and energy-efficiency. LIMO monitors threads ’ progress along with the usage of shared hardware resources to determine the best number of threads to run and the voltage and frequency level. With dynamic adaptation, LIMO provides an average of 21 % performance improvement and a 2x improvement in energy-efficiency on a 32-core system over the default configuration of 32 threads for a set of concurrent applications from the PARSEC suite, the Apache web server, and the Sphinx speech recognition system.
Dynamic Thread Pinning for Phase-Based OpenMP Programs
, 2013
"... Abstract. Thread affinity has appeared as an important technique to improve the overall program performance and for better performance stability. However, if we consider a program with multiple phases, it is unlikely that a single thread affinity produces the best program performance for all these p ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Abstract. Thread affinity has appeared as an important technique to improve the overall program performance and for better performance stability. However, if we consider a program with multiple phases, it is unlikely that a single thread affinity produces the best program performance for all these phases. If we consider the case of OpenMP, applications may have multiple parallel regions, each with a distinct inter-thread data sharing pattern. In this paper, we propose an approach that allows to change thread affinity dynamically (thread migrations) between parallel regions at runtime to account for these distinct inter-thread data sharing patterns. We demonstrate that as far as cache sharing is concerned for SPEC OMP01, not all the tested OpenMP applications exhibit a distinct phase behavior. However, we show that while fixing thread affinity for the whole execution may improve performance by up to 30%, allowing dynamic thread pinning may improve performance by up to 40%. Furthermore, we provide an analysis about the required conditions to improve the effectiveness of the approach.
Adaptive, efficient, parallel execution of parallel programs
- In PLDI
, 2014
"... Abstract Future multicore processors will be heterogeneous, be increasingly less reliable, and operate in dynamically changing operating conditions. Such environments will result in a constantly varying pool of hardware resources which can greatly complicate the task of efficiently exposing a progr ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract Future multicore processors will be heterogeneous, be increasingly less reliable, and operate in dynamically changing operating conditions. Such environments will result in a constantly varying pool of hardware resources which can greatly complicate the task of efficiently exposing a program's parallelism onto these resources. Coupled with this uncertainty is the diverse set of efficiency metrics that users may desire. This paper proposes Varuna, a system that dynamically, continuously, rapidly and transparently adapts a program's parallelism to best match the instantaneous capabilities of the hardware resources while satisfying different efficiency metrics. Varuna is applicable to both multithreaded and task-based programs and can be seamlessly inserted between the program and the operating system without needing to change the source code of either. We demonstrate Varuna's effectiveness in diverse execution environments using unaltered C/C++ parallel programs from various benchmark suites. Regardless of the execution environment, Varuna always outperformed the state-of-the-art approaches for the efficiency metrics considered.
Dynamic Adaptive Resource Management in a Virtualized NUMA Multicore System for Optimizing Power, Energy, and Performance
, 2013
"... A Non-Uniform Memory-Access (NUMA) machine is currently the most deployed type of a hardware architecture for high performance computing. System virtualization on the other hand is increasingly adopted for various reasons. In a virtualized NUMA system, the NUMA attributes are transparent to guest OS ..."
Abstract
- Add to MetaCart
(Show Context)
A Non-Uniform Memory-Access (NUMA) machine is currently the most deployed type of a hardware architecture for high performance computing. System virtualization on the other hand is increasingly adopted for various reasons. In a virtualized NUMA system, the NUMA attributes are transparent to guest OS’s. Thus, a Virtual Machine Monitor (VMM) is required to have NUMA-aware resource management. Tradeoffs between performance, power, and energy are observable as virtual cores (vcores) and/or virtual addresses are mapped in different ways. For example, sparsely located vcores have an advantage in memory caching compared to densely located vcores. On the other hand, densely located vcores tend to save power. Such tradeoffs lead to an abstract question: how a VMM as a resource manager can optimally or near-optimally execute guests under a NUMA architecture. In my dissertation, I claim that it is possible to solve this problem in real time through a dynamic adaptive system. Workload-aware scheduling, mapping, and shared resource management are controlled by adaptive schemes. The user may demand one of three objectives: performance, energy, or power. My system also incorporates a new detection framework that observes shared memory access behaviors with minimal overheads,
Runtime Support For Maximizing Performance on Multicore Systems
, 2012
"... First and foremost I would like to sincerely thank my advisor, Dr. Rajiv Gupta, who was always there for me and shaped my research in many ways. His enthusiasm in research and hard working nature were instrumental in enabling my research to make the progress which it has made. I am particularly grat ..."
Abstract
- Add to MetaCart
(Show Context)
First and foremost I would like to sincerely thank my advisor, Dr. Rajiv Gupta, who was always there for me and shaped my research in many ways. His enthusiasm in research and hard working nature were instrumental in enabling my research to make the progress which it has made. I am particularly grateful for all the freedom he gave me in selecting research problems and his seemingly never-ending trust in my potential. Next, I would like to thank the members of my dissertation committee, Dr. Laxmi N. Bhuyan and Dr. Walid Najjar for reviewing this dissertation. Their extensive and constructive comments have been very helpful in improving this dissertation. I was fortunate enough to do various internships during the course of my Ph.D.
The Road to Parallelism Leads Through Sequential Programming
"... Multicore processors are already ubiquitous and are now the targets of hundreds of thousands of applications. Due to a variety of reasons parallel programming has not been widely adopted to program even current homogeneous, known-resource multicore processors. Future multicore processors will be het ..."
Abstract
- Add to MetaCart
(Show Context)
Multicore processors are already ubiquitous and are now the targets of hundreds of thousands of applications. Due to a variety of reasons parallel programming has not been widely adopted to program even current homogeneous, known-resource multicore processors. Future multicore processors will be heterogeneous, be increasingly less reliable, and operate in environments with dynamically changing operating conditions. With energy efficiency as a primary goal, they will present even more parallel execution challenges to the masses of unsophisticated programmers. Rather than attempt to make parallel programming more practical by chipping away at the multitude of its known drawbacks, we argue that sequential programs and their dynamic parallel execution is a better model. The paper outlines how to achieve: (i) dynamic parallel execution from a suitably-written statically sequential program, (ii) energy-efficient execution by dynamically and continuously controlling the parallelism, and (iii) a low-overhead precise-restartable parallel execution. 1
BruceR.Childers
"... Withtheshifttomany-corechipmultiprocessors(CMPs),acritical issue is how to effectively coordinate and manage the execution of applications and hardware resources to overcome performance, power consumption, and reliability challenges stemming from hardwareandapplicationvariationsinherentinthisnewcomp ..."
Abstract
- Add to MetaCart
(Show Context)
Withtheshifttomany-corechipmultiprocessors(CMPs),acritical issue is how to effectively coordinate and manage the execution of applications and hardware resources to overcome performance, power consumption, and reliability challenges stemming from hardwareandapplicationvariationsinherentinthisnewcomputing environment. Effective resource and application management on CMPsrequiresconsideration ofuser/application/hardware-specific requirements and dynamic adaption of management decisions based on the actual run-time environment. However, designing an algorithm to manage resources and applications that can dynamically adapt based on the run-time environment is difficult because most resource and application management and monitoring facilities are only available at the operating system level. This paper presents REEact, an infrastructure that provides the capability to