Results 1 - 10
of
80
DAISY: Dynamic Compilation for 100% Architectural Compatibility
, 1997
"... Although VLIW architectures offer the advantages of simplicity of design and high issue rates, a major impediment to their use is that they are not compatible with the existing software base. We describe new simple hardware features for a VLIW machine we call DAISY (Dynamically Architected Instructi ..."
Abstract
-
Cited by 173 (12 self)
- Add to MetaCart
Although VLIW architectures offer the advantages of simplicity of design and high issue rates, a major impediment to their use is that they are not compatible with the existing software base. We describe new simple hardware features for a VLIW machine we call DAISY (Dynamically Architected Instruction Set from Yorlaown). DAISY is specifically intended to emulate existing architectures, so that all existing software for an old architecture (including operating system kernel code) runs without changes on the VLIW. Each time a new fragment of code is executed for the first time, the code is translated to VLIW primitives, parallelized and saved in a portion of main memory not visible to the old architecture, by a Firtual Machine Monitor (software) residing in read only memory. Subsequent executions of the same fragment do not require a translation (unless cast out). We discuss the architectural requirements for such a VLIW, to deal with issues including self-modifying code, precise exceptions, and aggressive reordedng of memory references in the presence of strong MP consistency and memory mapped I/O. We have implemented the dynamic parallelization algorithms for the PowerPC architecture. The initial results show high degrees of instruction level parallelism with reasonable translation overhead and memory usage.
CHIMAERA: a high-performance architecture with a tightly-coupled reconfigurable functional unit
- IN PROCEEDINGS OF THE 27TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE
, 2000
"... Reconfigurable hardware has the potential for significant performance improvements by providing support for application−specific operations. We report our experience with Chimaera, a prototype system that integrates a small and fast reconfigurable functional unit (RFU) into the pipeline of an aggres ..."
Abstract
-
Cited by 76 (1 self)
- Add to MetaCart
Reconfigurable hardware has the potential for significant performance improvements by providing support for application−specific operations. We report our experience with Chimaera, a prototype system that integrates a small and fast reconfigurable functional unit (RFU) into the pipeline of an aggressive, dynamically−scheduled superscalar processor. Chimaera is capable of performing 9−input/1−output operations on integer data. We discuss the Chimaera C compiler that automatically maps computations for execution in the RFU. Chimaera is capable of: (1) collapsing a set of instructions into RFU operations, (2) converting control−flow into RFU operations, and (3) supporting a more powerful fine−grain data−parallel model than that supported by current multimedia extension instruction sets (for integer operations). Using a set of multimedia and communication applications we show that even with simple optimizations, the Chimaera C compiler is able to map 22 % of all instructions to the RFU on the average. A variety of computations are mapped into RFU operations ranging from as simple as add/sub−shift pairs to operations of more than 10 instructions including several branches. Timing experiments demonstrate that for a 4−way out−of−order superscalar processor Chimaera results in average performance improvements of 21%, assuming a very aggressive core processor design (most pessimistic RFU latency model) and communication overheads from and to the RFU.
Space-Time Scheduling of Instruction-Level Parallelism on a Raw Machine
- In Proceedings of the Eighth ACM Conference on Architectural Support for Programming Languages and Operating Systems
, 1998
"... Increasing demand for both greater parallelism and faster clocks dictate that future generation architectures will need to decentralize their resources and eliminate primitives that require single cycle global communication. A Raw microprocessor distributes all of its resources, including instructio ..."
Abstract
-
Cited by 71 (15 self)
- Add to MetaCart
Increasing demand for both greater parallelism and faster clocks dictate that future generation architectures will need to decentralize their resources and eliminate primitives that require single cycle global communication. A Raw microprocessor distributes all of its resources, including instruction streams, register files, memory ports, and ALUs, over a pipelined two-dimensional mesh interconnect, and exposes them fully to the compiler. Because communication in Raw machines is distributed, compiling for instructionlevel parallelism (ILP) requires both spatial instruction partitioning as well as traditional temporal instruction scheduling. In addition, the compiler must explicitly manage all communication through the interconnect, including the global synchronization required at branch points. This paper describes RAWCC, the compiler we have developed for compiling general-purpose sequential programs to the distributed Raw architecture. We present performance results that demonstrate that although Raw machines provide no mechanisms for global communication the Raw compiler can schedule to achieve speedups that scale with the number of available functional units.
A Comparison of Full and Partial Predicated Execution Support for ILP Processors
- IN PROCEEDINGS OF THE 22TH INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE
, 1995
"... One can effectively utilize predicated execution to improve branch handling in instruction-level parallel processors. Although the potential benefits of predicated execution are high, the tradeoffs involved in the design of an instruction set to support predicated execution can be difficult. On one ..."
Abstract
-
Cited by 58 (9 self)
- Add to MetaCart
One can effectively utilize predicated execution to improve branch handling in instruction-level parallel processors. Although the potential benefits of predicated execution are high, the tradeoffs involved in the design of an instruction set to support predicated execution can be difficult. On one end of the design spectrum, architectural support for full predicated execution requires increasing the number of source operands for all instructions. Full predicate support provides for the most flexibility and the largest potential performance improvements. On the other end, partial predicated execution support, such as conditional moves, requires very little change to existing architectures. This paper presents a preliminary study to qualitatively and quantitatively address the benefit of full and partial predicated execution support. With our current compiler technology, we show that the compiler can use both partial and full predication to achieve speedup in large control-intensive programs. Some details of the code generation techniques are shown to provide insight into the benefit of going from partial to full predication. Preliminary experimental results are very encouraging: partial predication provides an average of 33% performance improvement for an 8-issue processor with no predicate support while full predication provides an additional 30% improvement.
HPL-PD architecture specification: Version 1.1
, 2000
"... instruction-level parallelism, parametric architecture, EPIC, VLIW, superscalar, speculative execution, predicated execution, programmatic cache control, run-time memory disambiguation, branch architecture HPL-PD is a parametric processor architecture conceived for research in instruction-level para ..."
Abstract
-
Cited by 52 (6 self)
- Add to MetaCart
instruction-level parallelism, parametric architecture, EPIC, VLIW, superscalar, speculative execution, predicated execution, programmatic cache control, run-time memory disambiguation, branch architecture HPL-PD is a parametric processor architecture conceived for research in instruction-level parallelism (ILP). Its main purpose is to serve as a vehicle to investigate processor architectures having significant parallelism and to investigate the compiler technology needed to effectively exploit such architectures. The architecture is parametric in that it admits machines of different composition and scale, especially with respect to the nature and amount of parallelism offered. The architecture admits EPIC, VLIW and superscalar implementations so as to provide a basis for understanding the merits and demerits of these different styles of implementation. This report describes those parts of the architecture that are common to all machines in the family. It introduces the basic concepts such as the structure of an instruction, instruction execution semantics, the types of register files, etc. and describes the semantics of the operation repertoire.
Characterizing the Impact of Predicated Execution on Branch Prediction
, 1994
"... Branch instructions are recognized as a major impediment to exploiting instruction level parallelism. Even with sophisticated branch prediction techniques, many frequently executed branches remain difficult to predict. An architecture supporting predicated execution may allow the compiler to remove ..."
Abstract
-
Cited by 47 (9 self)
- Add to MetaCart
Branch instructions are recognized as a major impediment to exploiting instruction level parallelism. Even with sophisticated branch prediction techniques, many frequently executed branches remain difficult to predict. An architecture supporting predicated execution may allow the compiler to remove many of these hard-to-predict branches, reducing the number of branch mispredictions and thereby improving performance. We present an in-depth analysis of the characteristics of those branches which are frequently mispredicted and examine the effectiveness of an advanced compiler to eliminate these branches. Over the benchmarks studied, an average of 27% of the dynamic branches and 56% of the dynamic branch mispredictions are eliminated with predicated execution support.
A Framework for Balancing Control Flow and Predication
- IN PROCEEDINGS OF THE 30TH INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE
, 1997
"... Predicated execution is a promising architectural feature for exploiting instruction-level parallelism in the presence of control flow. Compiling for predicated execution involves converting program control flow into conditional, or predicated, instructions. This process is known as if-conversion. I ..."
Abstract
-
Cited by 47 (4 self)
- Add to MetaCart
Predicated execution is a promising architectural feature for exploiting instruction-level parallelism in the presence of control flow. Compiling for predicated execution involves converting program control flow into conditional, or predicated, instructions. This process is known as if-conversion. In order to effectively apply if-conversion, one must address two major issues: what should be if-converted and when the if-conversion should be applied. A compiler's use of predication as a representation is most effective when large amounts of code are if-converted and if-conversion is performed early in the compilation procedure. On the other hand, the final code generated for a processor with predicated execution requires a delicate balance between control flow and predication to achieve efficient execution. The appropriate balance is tightly coupled with scheduling decisions and detailed processor characteristics. This paper presents an effective compilation framework that allows the compiler to maximize the benefits of predication as a compiler representation while delaying the final balancing of control flow and predication to schedule time.
LISA - Machine Description Language for Cycle-Accurate Models of Programmable DSP Architectures
, 1999
"... This paper presents the machine description language LISA for the generation of bitand cycle accurate models of DSP processors. Based on a behavioral operation description, the architectural details and pipeline operations of modern DSP processors can be covered. Beyond the behavioral model, LISA de ..."
Abstract
-
Cited by 44 (5 self)
- Add to MetaCart
This paper presents the machine description language LISA for the generation of bitand cycle accurate models of DSP processors. Based on a behavioral operation description, the architectural details and pipeline operations of modern DSP processors can be covered. Beyond the behavioral model, LISA descriptions include other architecture-related information like the instruction set. The information provided by LISA models enables automatic generation of simulators and assemblers which are essential elements of DSP software development environments. In order to proof the applicability of our approach, a realized model of the Texas Instruments TMS320C6201 DSP is presented and derived LISA code examples are given.
Dynamic Rescheduling: A Technique for Object Code Compatibility in VLIW Architectures
, 1995
"... Lack of object code compatibility in VLIW architectures is a severe limit to their adoption as a generalpurpose computing paradigm. Previous approaches include hardware and software techniques, both of which have drawbacks. Hardware techniques add to the complexity of the architecture, whereas softw ..."
Abstract
-
Cited by 43 (12 self)
- Add to MetaCart
Lack of object code compatibility in VLIW architectures is a severe limit to their adoption as a generalpurpose computing paradigm. Previous approaches include hardware and software techniques, both of which have drawbacks. Hardware techniques add to the complexity of the architecture, whereas software techniques require multiple executables. This paper presents a technique called Dynamic Rescheduling that applies software techniques dynamically, using intervention by the operating system. Results are presented to demonstrate the viability of the technique using the Illinois IMPACT compiler and the TINKER architectural framework. 1 Introduction Lack of object-code compatibility across generations of a VLIW architecture is an often raised objection to its use as a general-purpose computing paradigm [1]. A program binary compiled for VLIW generation x cannot be guaranteed to execute correctly on generations x + n or x \Gamma n, for a reasonable value of n. This means that an installe...
Load-reuse analysis: Design and evaluation
- IN PROCEEDINGS OF THE ACM SIGPLAN ’99 CONFERENCE ON PROGRAMMING LANGUAGE DESIGN AND IMPLEMENTATION
, 1999
"... ..."

