Results 1 - 10
of
966
The Chimera Reconfigurable Functional Unit
, 2004
"... By strictly separating reconfigurable logic from the host processor, current custom computing systems suffer from a significant communication bottleneck. In this paper, we describe Chimaera, a system that overcomes the communication bottleneck by integrating reconfigurable logic into the host proce ..."
Abstract
-
Cited by 190 (19 self)
- Add to MetaCart
By strictly separating reconfigurable logic from the host processor, current custom computing systems suffer from a significant communication bottleneck. In this paper, we describe Chimaera, a system that overcomes the communication bottleneck by integrating reconfigurable logic into the host processor itself. With direct access to the host processor’s register file, the system enables the creation of multi-operand instructions and a speculative execution model key to high-performance, general-purpose reconfigurable computing. Chimaera also supports multi-output functions and utilizes partial run-time reconfiguration to reduce reconfiguration time. Combined, the system can provide speedups of a factor of two or more for general-purpose computing, and speedups of 160 or more are possible for hand-mapped applications.
NanoFabrics: Spatial Computing Using Molecular Electronics
"... The continuation of the remarkable exponential increases in processing power over the recent past faces imminent challenges due in part to the physics of deep-submicron CMOS devices and the costs of both chip masks and future fabrication plants. A promising solution to these problems is offered by a ..."
Abstract
-
Cited by 164 (12 self)
- Add to MetaCart
(Show Context)
The continuation of the remarkable exponential increases in processing power over the recent past faces imminent challenges due in part to the physics of deep-submicron CMOS devices and the costs of both chip masks and future fabrication plants. A promising solution to these problems is offered by an alternative to CMOS-based computing, chemically assembled electronic nanotechnology (CAEN). In this paper we outline how CAEN-based computing can become a reality. We briefly describe recent work in CAEN and how CAEN will affect computer architecture. We show how the inherently reconfigurable nature of CAEN devices can be exploited to provide high-density chips with defect tolerance at significantly reduced manufacturing costs. We develop a layered abstract architecture for CAEN-based computing devices and we present preliminary results which indicate that such devices will be competitive with CMOS circuits.
A Chip-Multiprocessor Architecture with Speculative Multithreading.
- IEEE Trans. on Computers,
, 1999
"... ..."
(Show Context)
3d-stacked memory architectures for multi-core processors
- In International Symposium on Computer Architecture
"... Three-dimensional integration enables stacking memory directly on top of a microprocessor, thereby significantly reducing wire delay between the two. Previous studies have examined the performance benefits of such an approach, but all of these works only consider commodity 2D DRAM organizations. In ..."
Abstract
-
Cited by 132 (7 self)
- Add to MetaCart
(Show Context)
Three-dimensional integration enables stacking memory directly on top of a microprocessor, thereby significantly reducing wire delay between the two. Previous studies have examined the performance benefits of such an approach, but all of these works only consider commodity 2D DRAM organizations. In this work, we explore more aggressive 3D DRAM organizations that make better use of the additional die-to-die bandwidth provided by 3D stacking, as well as the additional transistor count. Our simulation results show that with a few simple changes to the 3D-DRAM organization, we can achieve a 1.75 × speedup over previously proposed 3D-DRAM approaches on our memoryintensive multi-programmed workloads on a quad-core processor. The significant increase in memory system performance makes the L2 miss handling architecture (MHA) a new bottleneck, which we address by combining a novel data structure called the Vector Bloom Filter with dynamic MSHR capacity tuning. Our scalable L2 MHA yields an additional 17.8 % performance improvement over our 3D-stacked memory architecture. 1.
Automatic Application-Specific Instruction-Set Extensions Under Microarchitectural Constraints
, 2003
"... Many commercial processors now offer the possibility of extending their instruction set for a specific application---that is, to introduce customised functional units. There is a need to develop algorithms that decide automatically, from highlevel application code, which operations are to be carried ..."
Abstract
-
Cited by 126 (31 self)
- Add to MetaCart
Many commercial processors now offer the possibility of extending their instruction set for a specific application---that is, to introduce customised functional units. There is a need to develop algorithms that decide automatically, from highlevel application code, which operations are to be carried out in the customised extensions. A few algorithms exist but are severely limited in the type of operation clusters they can choose and hence reduce significantly the effectiveness of specialisation. In this paper we introduce a more general algorithm which selects maximal-speedup convex subgraphs of the application dataflow graph under fundamental microarchitectural constraints, and which improves significantly on the state of the art.
CommBench - A Telecommunications Benchmark for Network Processors
- IN PROC. OF IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE (ISPASS
, 2000
"... This paper presents a benchmark, CommBench, for use in evaluating and designing telecommunications network processors. The benchmark applications focus on small, computationally intense program kernels typical of the network processor environment. The benchmark ..."
Abstract
-
Cited by 125 (20 self)
- Add to MetaCart
This paper presents a benchmark, CommBench, for use in evaluating and designing telecommunications network processors. The benchmark applications focus on small, computationally intense program kernels typical of the network processor environment. The benchmark
On-line Scheduling of Hard Real-Time Tasks on Variable Voltage Processor
, 1998
"... We consider the problem of scheduling the mixed workload of both sporadic (on-line) and periodic (off-line) tasks on variable voltage processor to optimize power consumption while ensuring that all periodic tasks meet their deadlines and to accept as many sporadic tasks, which can be guaranteed to m ..."
Abstract
-
Cited by 121 (7 self)
- Add to MetaCart
(Show Context)
We consider the problem of scheduling the mixed workload of both sporadic (on-line) and periodic (off-line) tasks on variable voltage processor to optimize power consumption while ensuring that all periodic tasks meet their deadlines and to accept as many sporadic tasks, which can be guaranteed to meet their deadlines, as possible. The proposed efficient algorithms result in the scheduling solutions, which are very close to the minimum bound achievable with the dynamically variable voltage approach. The effectiveness of the proposed algorithms is shown on extensive experiments with real-life design examples. 1
A Design Space Evaluation of Grid Processor Architectures
, 2001
"... In this paper, we survey the design space of a new class of architec-tures called Grid Processor Architectures (GPAs). These architectures are designed to scale with technology, allowing faster clock rates than conventional architectures while providing superior instruction-level parallelism on trad ..."
Abstract
-
Cited by 118 (33 self)
- Add to MetaCart
In this paper, we survey the design space of a new class of architec-tures called Grid Processor Architectures (GPAs). These architectures are designed to scale with technology, allowing faster clock rates than conventional architectures while providing superior instruction-level parallelism on traditional workloads and high performance across a range of application classes. A GPA consists of an array of ALUs, each with limited control, connected by a thin operand network. Pro-grams are executed by mapping blocks of statically scheduled instruc-tions to the ALU array and executing them dynamically in dataflow or-der This organization enables the critical paths of instruction blocks to be executed on chains of ALUs without transmitting temporary val-ues back to the register file, avoiding most of the large, unscalable structures that limit the scalability of conventional architectures. Fi-nally, we present simulation results of a preliminary design, the GPA-1. With a half-cycle routing delay, we obtain performance roughly equal to an ideal 8-way, 512-entry window superscalar core. With no inter-ALU delay, perfect memory, and perfect branch prediction, the 1PC of the GPA-1 is more than twice that of the ideal superscalar core, achieving an average of 11 IPC across nine SPEC CPU2000 and Mediabench benchmarks.
Meta Optimization: Improving Compiler Heuristics with Machine Learning
- In Proceedings of the ACM SIGPLAN ’03 Conference on Programming Language Design and Implementation
, 2002
"... Compiler writers have crafted many heuristics over the years to (approximately) solve NP-hard problems efficiently. Finding a heuristic that performs well on a broad range of applications is a tedious and difficult process. This paper introduces Meta Optimization, a methodology for automatically fin ..."
Abstract
-
Cited by 116 (6 self)
- Add to MetaCart
Compiler writers have crafted many heuristics over the years to (approximately) solve NP-hard problems efficiently. Finding a heuristic that performs well on a broad range of applications is a tedious and difficult process. This paper introduces Meta Optimization, a methodology for automatically fine-tuning compiler heuristics. Meta Optimization uses machine-learning techniques to automatically search an optimization's solution space. We implemented Meta Optimization on top of Trimaran [20] to test its efficacy. By `evolving' Trimaran's hyperblock selection optimization for a particular benchmark, our system achieves impressive speedups. Application-specific heuristics obtain an average speedup of 23% (up to 43%) for the applications in our suite. Furthermore, by evolving a compiler's heuristic over several benchmarks, we can create effective, general-purpose compilers. The best general-purpose heuristic our system found improved Trimaran's hyperblock selection algorithm by an average of 25% on our training set, and 9% on a completely unrelated test set. We further test the applicability of our system on Trimaran's priority-based coloring register allocator. For this well-studied optimization we were able to specialize the compiler for individual applications, achieving an average speedup of 6%.
NetBench: A benchmarking Suite for Network Processors
- In ICCAD
, 2001
"... Abstract — In this study we introduce NetBench, a benchmarking suite for network processors. NetBench contains a total of 9 applications that are representative of commercial applications for network processors. These applications are from all levels of packet processing; Small, low-level code fragm ..."
Abstract
-
Cited by 109 (11 self)
- Add to MetaCart
(Show Context)
Abstract — In this study we introduce NetBench, a benchmarking suite for network processors. NetBench contains a total of 9 applications that are representative of commercial applications for network processors. These applications are from all levels of packet processing; Small, low-level code fragments as well as large application level programs are included in the suite. Using SimpleScalar simulator we study the NetBench programs in detail and characterize the network processor workloads. We also compare key characteristics such as instructions per cycle, instruction distribution, branch prediction accuracy, and cache behavior with the programs from MediaBench. Although the aimed architectures are similar for MediaBench and NetBench suites, we show that these workloads have significantly different characteristics. Hence a separate benchmarking suite for network processors is a necessity. Finally, we present performance measurements from Intel IXP1200 Network Processor to show how NetBench can be utilized. 1.