Results 1 - 10
of
12
Merge: A Programming Model for Heterogeneous Multi-core Systems Abstract
"... In this paper we propose the Merge framework, a general purpose programming model for heterogeneous multi-core systems. The Merge framework replaces current ad hoc approaches to parallel programming on heterogeneous platforms with a rigorous, library-based methodology that can automatically distribu ..."
Abstract
-
Cited by 81 (1 self)
- Add to MetaCart
In this paper we propose the Merge framework, a general purpose programming model for heterogeneous multi-core systems. The Merge framework replaces current ad hoc approaches to parallel programming on heterogeneous platforms with a rigorous, library-based methodology that can automatically distribute computation across heterogeneous cores to achieve increased energy and performance efficiency. The Merge framework provides (1) a predicate dispatch-based library system for managing and invoking function variants for multiple architectures; (2) a high-level, library-oriented parallel language based on map-reduce; and (3) a compiler and runtime which implement the map-reduce language pattern by dynamically selecting the best available function implementations for a given input and machine configuration. Using a generic sequencer architecture interface for heterogeneous accelerators, the Merge framework can integrate function variants for specialized accelerators, offering the potential for to-the-metal performance for a wide range of heterogeneous architectures, all transparent to the user. The Merge framework has been prototyped on a heterogeneous platform consisting of an Intel Core 2 Duo CPU and an 8-core 32-thread Intel Graphics and Media Accelerator X3000, and a homogeneous 32-way Unisys SMP system with Intel Xeon processors. We implemented a set of benchmarks using the Merge framework and enhanced the library with X3000 specific implementations, achieving speedups of 3.6x – 8.5x using the X3000 and 5.2x – 22x using the 32-way system relative to the straight C reference implementation on a single IA32 core.
Carbon: architectural support for fine-grained parallelism on chip multiprocessors
- In ISCA ’07: Proceedings of the 34th annual international symposium on Computer architecture
, 2007
"... ABSTRACT Chip multiprocessors (CMPs) are now commonplace, and the num-ber of cores on a CMP is likely to grow steadily. However, in order ..."
Abstract
-
Cited by 67 (8 self)
- Add to MetaCart
(Show Context)
ABSTRACT Chip multiprocessors (CMPs) are now commonplace, and the num-ber of cores on a CMP is likely to grow steadily. However, in order
Efficient Operating System Scheduling for Performance-Asymmetric MultiCore Architectures
- in SC ’07
, 2007
"... Recent research advocates asymmetric multi-core architectures, where cores in the same processor can have different performance. These architectures support single-threaded performance and multithreaded throughput at lower costs (e.g., die size and power). However, they also pose unique challenges t ..."
Abstract
-
Cited by 66 (3 self)
- Add to MetaCart
(Show Context)
Recent research advocates asymmetric multi-core architectures, where cores in the same processor can have different performance. These architectures support single-threaded performance and multithreaded throughput at lower costs (e.g., die size and power). However, they also pose unique challenges to operating systems, which traditionally assume homogeneous hardware. This paper presents AMPS, an operating system scheduler that efficiently supports both SMPand NUMA-style performance-asymmetric architectures. AMPS contains three components: asymmetry-aware load balancing, fastercore-first scheduling, and NUMA-aware migration. We have implemented AMPS in Linux kernel 2.6.16 and used CPU clock modulation to emulate performance asymmetry on an SMP and NUMA system. For various workloads, we show that AMPS achieves a median speedup of 1.16 with a maximum of 1.44 over stock Linux on the SMP, and a median of 1.07 with a maximum of 2.61 on the NUMA system. Our results also show that AMPS improves fairness and repeatability of application performance measurements. 1.
EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system
- In Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation
, 2007
"... Future mainstream microprocessors will likely integrate specialized accelerators, such as GPUs, onto a single die to achieve better performance and power efficiency. However, it remains a keen challenge to program such a heterogeneous multi-core platform, since these specialized accelerators feature ..."
Abstract
-
Cited by 41 (2 self)
- Add to MetaCart
(Show Context)
Future mainstream microprocessors will likely integrate specialized accelerators, such as GPUs, onto a single die to achieve better performance and power efficiency. However, it remains a keen challenge to program such a heterogeneous multi-core platform, since these specialized accelerators feature ISAs and functionality that are significantly different from the general purpose CPU cores. In this paper, we present EXOCHI: (1) Exoskeleton Sequencer (EXO), an architecture to represent heterogeneous accelerators as ISA-based MIMD architecture resources, and a shared virtual memory heterogeneous multithreaded program execution model that tightly couples specialized accelerator cores with general purpose CPU cores, and (2) C for Heterogeneous Integration (CHI), an integrated C/C++ programming environment that supports accelerator-specific inline assembly and domain-specific languages.
Operating system support for overlapping-isa heterogeneous multi-core architectures
- in High Performance Computer Architecture (HPCA), 2010 IEEE 16th International Symposium on
, 2010
"... A heterogeneous processor consists of cores that are asymmetric in performance and functionality. Such a de-sign provides a cost-effective solution for processor man-ufacturers to continuously improve both single-thread per-formance and multi-thread throughput. This design, how-ever, faces significa ..."
Abstract
-
Cited by 25 (2 self)
- Add to MetaCart
(Show Context)
A heterogeneous processor consists of cores that are asymmetric in performance and functionality. Such a de-sign provides a cost-effective solution for processor man-ufacturers to continuously improve both single-thread per-formance and multi-thread throughput. This design, how-ever, faces significant challenges in the operating system, which traditionally assumes only homogeneous hardware. This paper presents a comprehensive study of OS support for heterogeneous architectures in which cores have asym-metric performance and overlapping, but non-identical in-struction sets. Our algorithms allow applications to trans-parently execute and fairly share different types of cores. We have implemented these algorithms in the Linux 2.6.24 kernel and evaluated them on an actual heterogeneous plat-form. Evaluation results demonstrate that our designs effi-ciently manage heterogeneous hardware and enable signifi-cant performance improvements for a range of applications. 1
Implicitly parallel programming models for thousandcore microprocessors. In
- DAC,
, 2007
"... ..."
(Show Context)
ATLAS: A Chip-Multiprocessor with Transactional Memory Support
"... Chip-multiprocessors are quickly becoming popular in embedded systems. However, the practical success of CMPs strongly depends on addressing the difficulty of multithreaded application development for such systems. Transactional Memory (T M) promises to simplify concurrency management in multithread ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
(Show Context)
Chip-multiprocessors are quickly becoming popular in embedded systems. However, the practical success of CMPs strongly depends on addressing the difficulty of multithreaded application development for such systems. Transactional Memory (T M) promises to simplify concurrency management in multithreaded applications by allowing programmers to specify coarse-grain parallel tasks, while achieving performance comparable to fine-grain lock-based applications. This paper presents AT LAS, the first prototype of a CMP with hardware support for transactional memory. AT-LAS includes 8 embedded PowerPC cores that access coherent shared memory in a transactional manner. The data cache for each core is modified to support the speculative buffering and conflict detection necessary for transactional execution. We have mapped ATLAS to the BEE2 multi-FPGA board to create a full-system prototype that operates at 100MHz, boots Linux, and provides significant performance and ease-of-use benefits for a range of parallel applications. Overall, the ATLAS prototype provides an excellent framework for further research on the software and hardware techniques necessary to deliver on the potential of transactional memory. 1
ABSTRACT A Practical FPGA-based Framework for Novel CMP Research
"... Chip-multiprocessors are quickly gaining momentum in all segments of computing. However, the practical success of CMPs strongly depends on addressing the difficulty of multithreaded application development. To address this challenge, it is necessary to co-develop new CMP architecture with novel prog ..."
Abstract
- Add to MetaCart
(Show Context)
Chip-multiprocessors are quickly gaining momentum in all segments of computing. However, the practical success of CMPs strongly depends on addressing the difficulty of multithreaded application development. To address this challenge, it is necessary to co-develop new CMP architecture with novel programming models. Currently, architecture research relies on software simulators which are too slow to facilitate interesting experiments with CMP software without using small datasets or significantly reducing the level of detail in the simulated models. An alternative to simulation is to exploit the rich capabilities of modern FPGAs to create FPGA-based platforms for novel CMP research. This paper presents ATLAS, the first prototype for CMPs with hardware support for Transactional Memory (TM), a technology aiming to simplify parallel programming. ATLAS uses the BEE2 multi-FPGA board to provide a system with 8 PowerPC cores that run at 100MHz and runs Linux. ATLAS provides significant benefits for CMP research such as 100x performance improvement over a software simulator and good visibility that helps with software tuning and architectural improvements. In addition to presenting and evaluating ATLAS, we share our observations about building a FPGA-based framework for CMP research. Specifically, we address issues such as overall performance, challenges of mapping ASIC-style CMP RTL on to FPGAs, software support, the selection criteria for the base processor, and the challenges of using pre-designed IP libraries.
Implicitly Parallel Programming Models for Thousand-Core Microprocessors
"... This paper argues for an implicitly parallel programming model for many-core microprocessors, and provides initial technical approaches towards this goal. In an implicitly par-allel programming model, programmers maximize algorithm-level parallelism, express their parallel algorithms by assert-ing h ..."
Abstract
- Add to MetaCart
(Show Context)
This paper argues for an implicitly parallel programming model for many-core microprocessors, and provides initial technical approaches towards this goal. In an implicitly par-allel programming model, programmers maximize algorithm-level parallelism, express their parallel algorithms by assert-ing high-level properties on top of a traditional sequential programming language, and rely on parallelizing compilers and hardware support to perform parallel execution under the hood. In such a model, compilers and related tools re-quire much more advanced program analysis capabilities and programmer assertions than what are currently available so that a comprehensive understanding of the input program’s concurrency can be derived. Such an understanding is then used to drive automatic or interactive parallel code genera-tion tools for a diverse set of parallel hardware organizations. The chip-level architecture and hardware should maintain parallel execution state in such a way that a strictly sequen-tial execution state can always be derived for the purpose of verifying and debugging the program. We argue that im-plicitly parallel programming models are critical for address-ing the software development crises and software scalability challenges for many-core microprocessors.