• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Thread tailor: Dynamically weaving threads together for efficient, adaptive parallel applications. In (2010)

by J Lee, H Wu, M Ravichandran, N Clark
Venue:ISCA,
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 16
Next 10 →

Thread reinforcer: Dynamically determining number of threads via os level monitoring

by Kishore Kumar Pusukuri, Rajiv Gupta, Laxmi N. Bhuyan - in Workload Characterization (IISWC), 2011 IEEE International Symposium on. IEEE, 2011
"... Abstract—It is often assumed that to maximize the performance of a multithreaded application, the number of threads created should equal the number of cores. While this may be true for systems with four or eight cores, this is not true for systems with larger number of cores. Our experiments with PA ..."
Abstract - Cited by 12 (3 self) - Add to MetaCart
Abstract—It is often assumed that to maximize the performance of a multithreaded application, the number of threads created should equal the number of cores. While this may be true for systems with four or eight cores, this is not true for systems with larger number of cores. Our experiments with PARSEC programs on a 24-core machine demonstrate this. Therefore, dynamically determining the appropriate number of threads for a multithreaded application is an important unsolved problem. In this paper we develop a simple technique for dynamically determining appropriate number of threads without recompiling the application or using complex compilation techniques or modifying Operating System policies. We first present a scalability study of eight programs from PARSEC conducted on a 24 core Dell PowerEdge R905 server running OpenSolaris.2009.06 for numbers of threads ranging from a few threads to 128 threads. Our study shows that not only does the maximum speedup achieved by these programs vary widely (from 3.6x to 21.9x), the number of threads that produce maximum speedups also vary widely (from 16 to 63 threads). By understanding the overall speedup behavior of these programs we identify the critical Operating System level factors that explain why the speedups vary with the number of threads. As an application of these observations, we develop a framework called “Thread Reinforcer” that dynamically monitors program’s execution to search for the number of threads that are likely to yield best speedups. Thread Reinforcer identifies optimal or near optimal number of threads for most of the PARSEC programs studied and as well as for SPEC OMP and PBZIP2 programs. I.
(Show Context)

Citation Context

...inding a suitable number of threads for a multi-threaded application to optimize system resources in a multi-core environment is an important open problem. The existing dynamic compilation techniques =-=[6]-=- for finding appropriate number of threads are quite complex. In [6] authors noted that the Operating System (OS) and hardware likely cannot infer enough information about the application to make effe...

Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems

by Janghaeng Lee, Mehrzad Samadi, Yongjun Park, Scott Mahlke
"... Abstract — Heterogeneous computing on CPUs and GPUs has traditionally used fixed roles for each device: the GPU handles data parallel work by taking advantage of its massive number of cores while the CPU handles non data-parallel work, such as the sequential code or data transfer management. Unfortu ..."
Abstract - Cited by 3 (1 self) - Add to MetaCart
Abstract — Heterogeneous computing on CPUs and GPUs has traditionally used fixed roles for each device: the GPU handles data parallel work by taking advantage of its massive number of cores while the CPU handles non data-parallel work, such as the sequential code or data transfer management. Unfortunately, this work distribution can be a poor solution as it under utilizes the CPU, has difficulty generalizing beyond the single CPU-GPU combination, and may waste a large fraction of time transferring data. Further, CPUs are performance competitive with GPUs on many workloads, thus simply partitioning work based on the fixed roles may be a poor choice. In this paper, we present the single kernel multiple devices (SKMD) system, a framework that transparently orchestrates collaborative execution of a single data-parallel kernel across multiple asymmetric CPUs and GPUs. The programmer is responsible for developing a single dataparallel kernel in OpenCL, while the system automatically partitions the workload across an arbitrary set of devices, generates kernels to execute the partial workloads, and efficiently merges the partial outputs together. The goal is performance improvement by maximally utilizing all available resources to execute the kernel. SKMD handles the difficult challenges of exposed data transfer costs and the performance variations GPUs have with respect to input size. On real hardware, SKMD achieves an average speedup of 29 % on a system with one multicore CPU and two asymmetric GPUs compared to a fastest device execution strategy for a set of popular OpenCL kernels. Index Terms—GPGPU, OpenCL, Collaboration, Data parallel I.
(Show Context)

Citation Context

...o minimize the overall execution time by balancing workloads among several devices. This is an extension of the NP-Hard bin packing problem [6] and a common problem in load balancing parallel systems =-=[17]-=-. The difference is that it involves more parameters, such as data transfer time between the host and devices and the cost of merging partial outputs. Most importantly, the performance of devices can ...

Performance evaluation and analysis of thread pinning strategies on multi-core platforms: Case study of spec omp applications on intel architectures

by Abdelhafid Mazouz , Sid-Ahmed-Ali Touati , Denis Barthou - in High Performance Computing and Simulation (HPCS), 2011 International Conference on. IEEE, 2011
"... ABSTRACT With the introduction of multi-core processors, thread affinity has quickly appeared to be one of the most important factors to accelerate program execution times. The current article presents a complete experimental study on the performance of various thread pinning strategies. We investi ..."
Abstract - Cited by 3 (1 self) - Add to MetaCart
ABSTRACT With the introduction of multi-core processors, thread affinity has quickly appeared to be one of the most important factors to accelerate program execution times. The current article presents a complete experimental study on the performance of various thread pinning strategies. We investigate four application independent thread pinning strategies and five application sensitive ones based on cache sharing. We made extensive performance evaluation on three different multi-core machines reflecting three usual utilisation: workstation machine, server machine and high performance machine. In overall, we show that fixing thread affinities (whatever the tested strategy) is a better choice for improving program performance on HPC ccNUMA machines compared to OS-based thread placement. This means that the current Linux OS scheduling strategy is not necessarily the best choice in terms of performance on ccNUMA machines, even if it is a good choice in terms of cores usage ratio and work balancing. On smaller Core2 and Nehalem machines, we show that the benefit of thread pinning is not satisfactory in terms of speedups versus OSbased scheduling, but the performance stability is much better.
(Show Context)

Citation Context

... guided method for partitioning memory accesses across distributed data caches. The difference with our work is that they focused on fine grain parallelism in single-threaded applications. Lee et al. =-=[15]-=- proposed a framework to automatically adjust the number of threads in an application to optimise system efficiency. The work assumes a uniform distribution of the data between threads. Kandemir et al...

When Less Is MOre (LIMO): Controlled Parallelism for Improved Efficiency

by Gaurav Chadha, Scott Mahlke, Satish Narayanasamy
"... While developing shared-memory programs, programmers often contend with the problem of how many threads to create for best efficiency. Creating as many threads as the number of available processor cores, or more, may not be the most efficient configuration. Too many threads can result in excessive c ..."
Abstract - Cited by 2 (0 self) - Add to MetaCart
While developing shared-memory programs, programmers often contend with the problem of how many threads to create for best efficiency. Creating as many threads as the number of available processor cores, or more, may not be the most efficient configuration. Too many threads can result in excessive contention for shared resources, wasting energy, which is of primary concern for embedded devices. Furthermore, thermal and power constraints prevent us from operating all the processor cores at the highest possible frequency, favoring fewer threads. The best number of threads to run depends on the application, user input and hardware resources available. It can also change at runtime making it infeasible for the programmer to determine this number. To address this problem, we propose LIMO, a runtime system that dynamically manages the number of running threads of an application for maximizing peformance and energy-efficiency. LIMO monitors threads ’ progress along with the usage of shared hardware resources to determine the best number of threads to run and the voltage and frequency level. With dynamic adaptation, LIMO provides an average of 21 % performance improvement and a 2x improvement in energy-efficiency on a 32-core system over the default configuration of 32 threads for a set of concurrent applications from the PARSEC suite, the Apache web server, and the Sphinx speech recognition system.
(Show Context)

Citation Context

...r of threads equal to the number of available cores. To improve upon this scheme, previous work has proposed techniques that profile applications statically to choose an appropriate number of threads =-=[17,21,22]-=- to improve performance by reducing communication and contention for shared resources (however, they did not consider power constraints and DVFS which would further favor running fewer threads, nor we...

Dynamic Thread Pinning for Phase-Based OpenMP Programs

by Abdelhafid Mazouz, Sid-ahmed-ali Touati, Denis Barthou , 2013
"... Abstract. Thread affinity has appeared as an important technique to improve the overall program performance and for better performance stability. However, if we consider a program with multiple phases, it is unlikely that a single thread affinity produces the best program performance for all these p ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
Abstract. Thread affinity has appeared as an important technique to improve the overall program performance and for better performance stability. However, if we consider a program with multiple phases, it is unlikely that a single thread affinity produces the best program performance for all these phases. If we consider the case of OpenMP, applications may have multiple parallel regions, each with a distinct inter-thread data sharing pattern. In this paper, we propose an approach that allows to change thread affinity dynamically (thread migrations) between parallel regions at runtime to account for these distinct inter-thread data sharing patterns. We demonstrate that as far as cache sharing is concerned for SPEC OMP01, not all the tested OpenMP applications exhibit a distinct phase behavior. However, we show that while fixing thread affinity for the whole execution may improve performance by up to 30%, allowing dynamic thread pinning may improve performance by up to 40%. Furthermore, we provide an analysis about the required conditions to improve the effectiveness of the approach.
(Show Context)

Citation Context

...real cache activity. Some studies have addressed the data cache sharing at the compiler level. They focused on improving the data locality in multicores based on the architecture topology. Lee et al. =-=[7]-=- proposed a framework to automatically adjust the number of threads in an application to optimize system efficiency. The work assumes a uniform distribution of the data between threads. Kandemir et al...

Adaptive, efficient, parallel execution of parallel programs

by Srinath Sridharan , Gagan Gupta , Gurindar S Sohi - In PLDI , 2014
"... Abstract Future multicore processors will be heterogeneous, be increasingly less reliable, and operate in dynamically changing operating conditions. Such environments will result in a constantly varying pool of hardware resources which can greatly complicate the task of efficiently exposing a progr ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
Abstract Future multicore processors will be heterogeneous, be increasingly less reliable, and operate in dynamically changing operating conditions. Such environments will result in a constantly varying pool of hardware resources which can greatly complicate the task of efficiently exposing a program's parallelism onto these resources. Coupled with this uncertainty is the diverse set of efficiency metrics that users may desire. This paper proposes Varuna, a system that dynamically, continuously, rapidly and transparently adapts a program's parallelism to best match the instantaneous capabilities of the hardware resources while satisfying different efficiency metrics. Varuna is applicable to both multithreaded and task-based programs and can be seamlessly inserted between the program and the operating system without needing to change the source code of either. We demonstrate Varuna's effectiveness in diverse execution environments using unaltered C/C++ parallel programs from various benchmark suites. Regardless of the execution environment, Varuna always outperformed the state-of-the-art approaches for the efficiency metrics considered.

Dynamic Adaptive Resource Management in a Virtualized NUMA Multicore System for Optimizing Power, Energy, and Performance

by Chang Seok Bae , 2013
"... A Non-Uniform Memory-Access (NUMA) machine is currently the most deployed type of a hardware architecture for high performance computing. System virtualization on the other hand is increasingly adopted for various reasons. In a virtualized NUMA system, the NUMA attributes are transparent to guest OS ..."
Abstract - Add to MetaCart
A Non-Uniform Memory-Access (NUMA) machine is currently the most deployed type of a hardware architecture for high performance computing. System virtualization on the other hand is increasingly adopted for various reasons. In a virtualized NUMA system, the NUMA attributes are transparent to guest OS’s. Thus, a Virtual Machine Monitor (VMM) is required to have NUMA-aware resource management. Tradeoffs between performance, power, and energy are observable as virtual cores (vcores) and/or virtual addresses are mapped in different ways. For example, sparsely located vcores have an advantage in memory caching compared to densely located vcores. On the other hand, densely located vcores tend to save power. Such tradeoffs lead to an abstract question: how a VMM as a resource manager can optimally or near-optimally execute guests under a NUMA architecture. In my dissertation, I claim that it is possible to solve this problem in real time through a dynamic adaptive system. Workload-aware scheduling, mapping, and shared resource management are controlled by adaptive schemes. The user may demand one of three objectives: performance, energy, or power. My system also incorporates a new detection framework that observes shared memory access behaviors with minimal overheads,
(Show Context)

Citation Context

...ve workloads. They use queuing theory to analyze the performance in terms of response time and propose dynamic control schemes using either fluid control or stochastic dynamic programming. Lee et al. =-=[79]-=- suggest an offline static analysis technique to construct a communication graph; then, dynamic compilation selects and combines some threads to reduce unnecessary synchronization cost. Energy, power,...

Runtime Support For Maximizing Performance on Multicore Systems

by Kishore Kumar Pusukuri , 2012
"... First and foremost I would like to sincerely thank my advisor, Dr. Rajiv Gupta, who was always there for me and shaped my research in many ways. His enthusiasm in research and hard working nature were instrumental in enabling my research to make the progress which it has made. I am particularly grat ..."
Abstract - Add to MetaCart
First and foremost I would like to sincerely thank my advisor, Dr. Rajiv Gupta, who was always there for me and shaped my research in many ways. His enthusiasm in research and hard working nature were instrumental in enabling my research to make the progress which it has made. I am particularly grateful for all the freedom he gave me in selecting research problems and his seemingly never-ending trust in my potential. Next, I would like to thank the members of my dissertation committee, Dr. Laxmi N. Bhuyan and Dr. Walid Najjar for reviewing this dissertation. Their extensive and constructive comments have been very helpful in improving this dissertation. I was fortunate enough to do various internships during the course of my Ph.D.
(Show Context)

Citation Context

...itable number of threads for a multithreaded application to optimize the use of system resources in a multicore environment is an important problem. 7.1.2 Dynamically Determining Number of Threads In =-=[27]-=-, Lee et al. show how to adjust number of threads in an application dynamically to optimize system efficiency. They develop a runtime system called “Thread Tailor” which uses dynamic compilation to co...

The Road to Parallelism Leads Through Sequential Programming

by Gagan Gupta, Srinath Sridharan, Gurindar S. Sohi
"... Multicore processors are already ubiquitous and are now the targets of hundreds of thousands of applications. Due to a variety of reasons parallel programming has not been widely adopted to program even current homogeneous, known-resource multicore processors. Future multicore processors will be het ..."
Abstract - Add to MetaCart
Multicore processors are already ubiquitous and are now the targets of hundreds of thousands of applications. Due to a variety of reasons parallel programming has not been widely adopted to program even current homogeneous, known-resource multicore processors. Future multicore processors will be heterogeneous, be increasingly less reliable, and operate in environments with dynamically changing operating conditions. With energy efficiency as a primary goal, they will present even more parallel execution challenges to the masses of unsophisticated programmers. Rather than attempt to make parallel programming more practical by chipping away at the multitude of its known drawbacks, we argue that sequential programs and their dynamic parallel execution is a better model. The paper outlines how to achieve: (i) dynamic parallel execution from a suitably-written statically sequential program, (ii) energy-efficient execution by dynamically and continuously controlling the parallelism, and (iii) a low-overhead precise-restartable parallel execution. 1
(Show Context)

Citation Context

...e programmer to prepare the application ahead of time to facilitate the process of adaptation in the execution environment. Several recent papers propose to dynamically vary the degree of parallelism =-=[15, 16, 26, 28, 29, 36, 40]-=- without any programmer involvement, but they all have several drawbacks. To summarize, they require offline analysis and learning with hints from the compiler, employ metrics and mechanisms that are ...

BruceR.Childers

by Ryanw Moore, Jackw Davidson, Mary Lousoffa
"... Withtheshifttomany-corechipmultiprocessors(CMPs),acritical issue is how to effectively coordinate and manage the execution of applications and hardware resources to overcome performance, power consumption, and reliability challenges stemming from hardwareandapplicationvariationsinherentinthisnewcomp ..."
Abstract - Add to MetaCart
Withtheshifttomany-corechipmultiprocessors(CMPs),acritical issue is how to effectively coordinate and manage the execution of applications and hardware resources to overcome performance, power consumption, and reliability challenges stemming from hardwareandapplicationvariationsinherentinthisnewcomputing environment. Effective resource and application management on CMPsrequiresconsideration ofuser/application/hardware-specific requirements and dynamic adaption of management decisions based on the actual run-time environment. However, designing an algorithm to manage resources and applications that can dynamically adapt based on the run-time environment is difficult because most resource and application management and monitoring facilities are only available at the operating system level. This paper presents REEact, an infrastructure that provides the capability to
(Show Context)

Citation Context

...rom [25] in that applications have their own policy (in the LEM) and are able to choose how to respond, instead of being limited to the Intel task scheduler’s available policy/policies. Works such as =-=[21]-=- are orthogonal to our approach, and might allow for the automatic creation of LEM policies. Java’s thread pool preallocates a pool of worker threads [1]. An application can then use these pre-created...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University