Results 1 - 10
of
30
Scalable Custom Instructions Identification for Instruction-Set Extensible Processors
- In CASES
, 2004
"... Extensible processors allow addition of application-specific custom instructions to the core instruction set architecture. However, it is computationally expensive to automatically select the optimal set of custom instructions. Therefore, heuristic techniques are often employed to quickly search the ..."
Abstract
-
Cited by 66 (8 self)
- Add to MetaCart
(Show Context)
Extensible processors allow addition of application-specific custom instructions to the core instruction set architecture. However, it is computationally expensive to automatically select the optimal set of custom instructions. Therefore, heuristic techniques are often employed to quickly search the design space. In this paper, we present an efficient algorithm for exact enumeration of all possible candidate instructions given the dataflow graph (DFG) corresponding to a code fragment. Even though this is similar to the “subgraph enumeration” problem (which is exponential), we find that most subgraphs are not feasible candidates for various reasons. In fact, the number of candidates is quite small compared to the size of the DFG. Compared to previous approaches, our technique achieves orders of magnitude speedup in enumerating these candidate custom instructions for very large DFGs.
Application-Specific Processing on a General-Purpose Core via Transparent Instruction Set Customization
- In Proceedings of the International Symposium on Microarchitecture
, 2004
"... Application-specific instruction set extensions are an effective way of improving the performance of processors. Critical computation subgraphs can be accelerated by collapsing them into new instructions that are executed on specialized function units. Collapsing the subgraphs simultaneously reduces ..."
Abstract
-
Cited by 47 (0 self)
- Add to MetaCart
(Show Context)
Application-specific instruction set extensions are an effective way of improving the performance of processors. Critical computation subgraphs can be accelerated by collapsing them into new instructions that are executed on specialized function units. Collapsing the subgraphs simultaneously reduces the length of computation as well as the number of intermediate results stored in the register file. The main problem with this approach is that a new processor must be generated for each application domain. While new instructions can be designed automatically, there is a substantial amount of engineering cost incurred to verify and to implement the final custom processor. In this work, we propose a strategy to transparent customization of the core computation capabilities of the processor without changing its instruction set. A configurable array of function units is added to the baseline processor that enables the acceleration of a wide range of dataflow subgraphs. To exploit the array, the microarchitecture performs subgraph identification at run-time, replacing them with new microcode instructions to configure and utilize the array. We compare the e#ectiveness of replacing subgraphs in the fill unit of a trace cache versus using a translation table during decode, and evaluate the tradeo#s between static and dynamic identification of subgraphs for instruction set customization.
Exploring the Design Space of LUT-based Transparent Accelerators
- In Proc. of the 2005 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems
, 2005
"... Instruction set customization accelerates the performance of applications by compressing the length of critical dependence paths and reducing the demands on processor resources. With instruction set customization, specialized accelerators are added to a conventional processor to atomically execute d ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
Instruction set customization accelerates the performance of applications by compressing the length of critical dependence paths and reducing the demands on processor resources. With instruction set customization, specialized accelerators are added to a conventional processor to atomically execute dataflow subgraphs. Accelerators that are exploited without explicit changes to the instruction set architecture of the processor are said to be transparent. Transparent acceleration relies on a light-weight hardware engine to dynamically generate control signals for the accelerator, using subgraphs delineated by a compiler. The design of transparent subgraph accelerators is challenging, as critical subgraphs need to be supported efficiently while maintaining area and timing constraints. Additionally, more complex accelerators require more sophisticated control generation engines. These factors must be carefully balanced. In this work, we investigate the design of subgraph accelerators using configurable lookup table structures. These designs provide an effective paradigm to execute a wide range of subgraphs involving arithmetic and logic operations. We describe why lookup table designs are effective, how they fit into a transparent acceleration framework, and evaluate the effectiveness of a wide range of designs using both simulation and logic synthesis.
Satisfying real-time constraints with custom instructions
- In ACM International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS
, 2005
"... Instruction-set extensible processors allow an existing processor core to be extended with application-specific custom instructions. In this paper, we explore a novel application of instruction-set extensions to meet timing constraints in real-time embedded systems. In order to satisfy real-time con ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
(Show Context)
Instruction-set extensible processors allow an existing processor core to be extended with application-specific custom instructions. In this paper, we explore a novel application of instruction-set extensions to meet timing constraints in real-time embedded systems. In order to satisfy real-time constraints, the worst-case execution time (WCET) of a task should be reduced as opposed to its average-case execution time. Unfortunately, existing custom instruction selection techniques based on average-case profile information may not reduce a task’s WCET. We first develop an Integer Linear Programming (ILP) formulation to choose optimal instruction-set extensions for reducing the WCET. However, ILP solutions for this problem are often too expensive to compute. Therefore, we also propose an efficient and scalable heuristic that obtains quite close to the optimal results. Experiment results indicate that suitable choice of custom instructions can reduce the WCET of our benchmark programs by as much as 42 % (23.5 % on an average).
Exploiting forwarding to improve data bandwidth of instruction-set extensions
- In 43rd DAC
, 2006
"... Application-specific instruction-set extensions (custom instructions) help embedded processors achieve higher performance. Most custom instructions offering significant performance benefit re-quire multiple input operands. Unfortunately, RISC-style embedded processors are designed to sup-port at mos ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
(Show Context)
Application-specific instruction-set extensions (custom instructions) help embedded processors achieve higher performance. Most custom instructions offering significant performance benefit re-quire multiple input operands. Unfortunately, RISC-style embedded processors are designed to sup-port at most two input operands per instruction. This data bandwidth problem is due to the limited number of read ports in the register file per instruction as well as the fixed-length instruction encod-ing. We propose to overcome this restriction by exploiting the data forwarding feature present in processor pipelines. With minimal modifications to the pipeline and the instruction encoding along with cooperation from the compiler, we can supply up to two additional input operands per custom instruction. Experimental results indicate that our approach achieves 87–100 % of the ideal perfor-mance limit for standard benchmark programs. Additionally, our scheme saves 25 % energy on an average by avoiding unnecessary accesses to the register file. 1
An Efficient Framework for Dynamic Reconfiguration of Instruction-Set Customization
, 2007
"... We present an efficient framework for dynamic reconfiguration of application-specific instruction-set customization. A key component of this framework is an iterative algorithm for temporal and spatial partitioning of the loop kernels. Our algorithm maximizes the performance gain of an application w ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
(Show Context)
We present an efficient framework for dynamic reconfiguration of application-specific instruction-set customization. A key component of this framework is an iterative algorithm for temporal and spatial partitioning of the loop kernels. Our algorithm maximizes the performance gain of an application while taking into consideration the dynamic reconfiguration cost. It selects the appropriate custom instruction-sets for the loops and maps them into appropriate configurations. We model the temporal partitioning problem as a k-way graph partitioning problem. A dynamic programming based solution is used for the spatial partitioning. Comprehensive experimental results indicate that our iterative algorithm is highly scalable while producing optimal or near-optimal (99 % of the optimal) performance gain.
Exhaustive Enumeration of Legal Custom Instructions for Extensible Processors
"... Today’s customizable processors allow the designer to augment the base processor with custom accelerators. By choosing appropriate set of accelerators, designer can significantly enhance the performance and power of an application. Due to the large number of accelerator choices and their complex tra ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
(Show Context)
Today’s customizable processors allow the designer to augment the base processor with custom accelerators. By choosing appropriate set of accelerators, designer can significantly enhance the performance and power of an application. Due to the large number of accelerator choices and their complex trade-offs among reuse, gain and area, manually deciding the optimal combination of accelerators is quite cumbersome and time consuming. This calls for CAD tools that select optimal combination of accelerators by thoroughly searching theentiredesignspace. Thetermpattern is commonly used to represent the computation performed by a custom accelerator. In this paper, we propose an algorithm for rapidly enumerating all the legal patterns taking into account several constraints posed by a typical micro-architecture. The proposed algorithm achieves significant reduction in run-time by a) enumerating the patterns in the increasing order of sizes and b) relating the characteristics of a (k +1) node pattern with the characteristics of its k node subgraphs. Also, in scenarios where I/O is not a bottleneck, designer can optionally relax the I/O constraint and our algorithm efficiently enumerates all legal I/O unbound legal patterns. The experimental evidence indicate an order of two run-time speedup over state of the art techniques. 1.
Custom instruction generation using temporal partitioning techniques for a reconfigurable functional unit
- Int. Conference on Embedded and Ubiquitous Computing
, 2006
"... Abstract. Extracting appropriate custom instructions is an important phase for implementing an application on an extensible processor with a reconfigurable functional unit (RFU). Custom instructions (CIs) are usually extracted from critical portions of applications. It may not be possible to meet a ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
(Show Context)
Abstract. Extracting appropriate custom instructions is an important phase for implementing an application on an extensible processor with a reconfigurable functional unit (RFU). Custom instructions (CIs) are usually extracted from critical portions of applications. It may not be possible to meet all of the RFU constraints when CIs are generated. This paper addresses the generation of mappable CIs on an RFU. In this paper, our proposed RFU architecture for an adaptive dynamic extensible processor is described. Then, an integrated framework for temporal partitioning and mapping is presented to partition and map the CIs on RFU. In this framework, two mapping aware temporal partitioning algorithms are used to generate CIs. Temporal partitioning iterates and modifies partitions incrementally to generate CIs. Using this framework brings about more speedup for the extensible processor.
Automated instruction-set extension of embedded processors with application to MPEG-4 video encoding
- in Proc. ASAP, 2005
"... A recent approach to platform-based design involves the use of extensible processors, offering architecture customization possibilities. Part of the designer responsibilities is the domain-specific extension of the baseline processor to fit customer requirements. Key issues of this process are the a ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
(Show Context)
A recent approach to platform-based design involves the use of extensible processors, offering architecture customization possibilities. Part of the designer responsibilities is the domain-specific extension of the baseline processor to fit customer requirements. Key issues of this process are the automated application analysis and candidate instruction identification/selection for implementation as applicationspecific functional units (AFUs). In this paper, a design approach that encapsulates automated workload characterization and instruction generation is utilized for extending processors to efficiently support embedded application sets. The method used for instruction generation is a highly parameterized adaptation of the MaxMISO technique, which allows for fast design space exploration. It is proven that only a small number of AFUs are needed in order to support the algorithms of interest (MPEG-4 encoding kernels) and that it is possible to achieve 2 × to 3.5 × performance improvements although further possibilities such as subword parallelization are not currently regarded. 1.
eMIPS, A Dynamically Extensible Processor
, 2006
"... The eMIPS architecture can realize the performance benefits of application-specific hardware optimizations in a general-purpose, multi-user system environment using a dynamically extensible processor architecture. It allows multiple secure Extensions to load dynamically and to plug into the stages o ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
The eMIPS architecture can realize the performance benefits of application-specific hardware optimizations in a general-purpose, multi-user system environment using a dynamically extensible processor architecture. It allows multiple secure Extensions to load dynamically and to plug into the stages of a pipelined data path, thereby extending the core instruction set of the microprocessor. Extensions can also be used to realize on-chip peripherals and if area permits even multiple cores. The new functionality can be exploited by patching the binaries of the existing applications, without requiring any changes to the compilers. A working FPGA prototype and a flexible simulation system demonstrate speedups of 2x-3x on a set of applications that include games, realtime programs and the SPEC2000 integer benchmarks. eMIPS is the first realized workstation based entirely on a dynamically extensible processor that is safe for general purpose, multi-user applications. By exposing the individual stages of the data path, eMIPS allows optimizations not previously possible. This includes permitting safe and coherent accesses to memory from within an Extension, optimizing multi-branched blocks, and throwing precise and restartable exceptions from within an Extension. 1