Results 1 - 10
of
14
VEGAS: Soft Vector Processor with Scratchpad Memory
"... This paper presents VEGAS, a new soft vector architecture, in which the vector processor reads and writes directly to a scratchpad memory instead of a vector register file. The scratchpad memory is a more efficient storage medium than a vector register file, allowing up to 9 × more data elements to ..."
Abstract
-
Cited by 14 (4 self)
- Add to MetaCart
(Show Context)
This paper presents VEGAS, a new soft vector architecture, in which the vector processor reads and writes directly to a scratchpad memory instead of a vector register file. The scratchpad memory is a more efficient storage medium than a vector register file, allowing up to 9 × more data elements to fit into on-chip memory. In addition, the use of fracturable ALUs in VEGAS allow efficient processing of bytes, halfwords and words in the same processor instance, providing up to 4 × the operations compared to existing fixedwidth soft vector ALUs. Benchmarks show the new VE-GAS architecture is 10 × to 208 × faster than Nios II and has 1.7 × to 3.1 × better area-delay product than previous vector work, achieving much higher throughput per unit area. To put this performance in perspective, VEGAS is faster than a leading-edge Intel processor at integer matrix multiply. To ease programming effort and provide full debug support, VEGAS uses a C macro API that outputs vector instructions as standard NIOS II/f custom instructions. Categories and Subject Descriptors C.1.2 [Multiple Data Stream Architectures (Multiprocessors)]: Array and vector processors; C.3 [Specialpurpose
B.Baas, “A GALS Many-Core Heterogeneous DSP Platform with source-Synchronous On-Chip Interconnection Network
- Proc. of Int. Symp. on Networks-on-Chip
, 2009
"... This paper presents a many-core heterogeneous computational platform that employs a GALS compatible circuit-switched on-chip network. The platform targets streaming DSP and embedded applications that have a high degree of task-level parallelism among computational kernels. The test chip was fabricat ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
(Show Context)
This paper presents a many-core heterogeneous computational platform that employs a GALS compatible circuit-switched on-chip network. The platform targets streaming DSP and embedded applications that have a high degree of task-level parallelism among computational kernels. The test chip was fabricated in 65nm CMOS consisting of 164 simple small programmable cores, three dedicated-purpose accelerators and three shared memory modules. All processors are clocked by their own local oscillators and communication is achieved through a simple yet effective source-synchronous communication technique that allows each interconnection link between any two processors to sustain a peak throughput of one data word per cycle. A complete 802.11a WLAN baseband receiver was implemented on this platform. It has a real-time throughput of 54 Mbps with all processors running at 594 MHz and 0.95 V, and consumes an average 174.76 mW with 12.18 mW (or 7.0%) dissipated by its interconnection links. We can fully utilize the benefit of the GALS architecture and by adjusting each processor’s oscillator to run at a workload-based optimal clock frequency with the chip’s dual supply voltages set at 0.95 V and 0.75 V, the receiver consumes only 123.18 mW, a 29.5 % in power reduction. Measured results of its power consumption on the real chip come within the difference of only 2-5 % compared with the estimated results showing our design to be highly reliable and efficient. 1.
Mighty-Morphing Power-SIMD
"... In modern wireless devices, two broad classes of compute-intensive applications are common: those with high amounts of data-level parallelism, such as signal processing used in wireless baseband applications, and those that have little data-level parallelism, such as encryption. Wide single-instruct ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
In modern wireless devices, two broad classes of compute-intensive applications are common: those with high amounts of data-level parallelism, such as signal processing used in wireless baseband applications, and those that have little data-level parallelism, such as encryption. Wide single-instruction multiple-data (SIMD) processors have become popular for providing high performance, yet power efficient data engines for applications with abundant data parallelism. However, the non-data-parallel applications are relegated to a low-performance scalar datapath on these data engines while the SIMD resources are left idle. To accelerate both types of applications, we propose the design of a more flexible SIMD datapath called SIMD-Morph. In SIMD-Morph, code with datalevel parallelism can be executed across the lanes in the traditional manner, but the lanes can be morphed into a feed-forward subgraph accelerator to execute scalar applications more efficiently. The morphed SIMD lanes form an accelerator that exploits both instruction-level parallelism as well as operation chaining to improve the performance of scalar code by exploiting the available resources in the SIMD lanes. Experimental results show that the performance impact is a 2.6X improvement for purely non-SIMD applications and a 1.4X improvement for the non-SIMD-ized portions of applications with data parallelism.
ENHANCING COEXISTENCE, QUALITY OF SERVICE, AND ENERGY PERFORMANCE IN DYNAMIC SPECTRUM ACCESS NETWORKS
, 2011
"... ..."
(Show Context)
Coarse-Grained Reconfigurable Array Architectures
"... Abstract Coarse-Grained Reconfigurable Array (CGRA) architectures accelerate the same inner loops that benefit from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efficiently. This chapter discusses t ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
Abstract Coarse-Grained Reconfigurable Array (CGRA) architectures accelerate the same inner loops that benefit from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efficiently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on flexibility, performance, and power-efficiency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual fine-tuning of source code. 1 Application Domain of Coarse-Grained Reconfigurable Arrays Many embedded applications require high throughput, meaning that a large number of computations needs to be performed every second. At the same time, the power consumption of battery-operated devices needs to be minimized to increase their autonomy. In general, the performance obtained on a programmable processor for a certain application can be defined as the reciprocal of the application execution time. Considering that most programs consist of a number of consecutive phases P = [1, p] with different characteristics, performance can be defined in terms of the operating frequencies fp, the instructions executed per cycle IPCp and the instruction counts ICp of each phase, and in terms of the time overhead involved in switching between the phases tp→p+1 as follows:
A low-energy wide simd architecture with explicit datapath,”
- Journal of Signal Processing Systems,
, 2014
"... Abstract Energy efficiency has become one of the most important topics in computing. To meet the ever increasing demands of the mobile market, the next generation of processors will have to deliver a high compute performance at an extremely limited energy budget. Wide single instruction, multiple d ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Abstract Energy efficiency has become one of the most important topics in computing. To meet the ever increasing demands of the mobile market, the next generation of processors will have to deliver a high compute performance at an extremely limited energy budget. Wide single instruction, multiple data (SIMD) architectures provide a promising solution, as they have the potential to achieve high compute performance at a low energy cost. We propose a configurable wide SIMD architecture that utilizes explicit datapath techniques to further optimize energy efficiency without sacrificing computational performance. To demonstrate the efficiency of the proposed architecture, multiple instantiations of the proposed wide SIMD architecture and its automatic bypassing counterpart, as well as a baseline RISC processor, are implemented. Extensive experimental results show that the proposed architecture is efficient and scalable in terms of area, performance, and energy. In a 128-PE SIMD processor, the proposed architecture is able to achieve an average of 206 times speed-up and reduces the total energy dissipation by 48.3% on average and up to 94%, compared to a reduced instruction set computing (RISC) processor. Compared to the corresponding SIMD architecture with automatic bypassing, an average of 64% of all register file accesses is avoided by the 128-PE, explicitly bypassed SIMD. For total energy dissipation, an average of 27.5%, and maximum of 43.0%, reduction is achieved.
Software Defined Radio Architecture Survey for Cognitive Testbeds
- in "Wireless Communications and Mobile Computing Conference (IWCMC), 2012 8th International", Limassol, Cyprus, September 2012, http://hal.inria.fr/hal-00736995. 22 Activity Report INRIA 2012
"... Abstract-In this paper we present a survey of existing prototypes dedicated to software defined radio. We propose a classification related to the architectural organization of the prototypes and provide some conclusions about the most promising architectures. This study should be useful for cogniti ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Abstract-In this paper we present a survey of existing prototypes dedicated to software defined radio. We propose a classification related to the architectural organization of the prototypes and provide some conclusions about the most promising architectures. This study should be useful for cognitive radio testbed designers who have to choose between many possible computing platforms. We also introduce a new cognitive radio testbed currently under construction and explain how this study have influenced the test-bed designers choices.
WiBench: An Open Source Kernel Suite for Benchmarking Wireless Systems
"... Abstract-The rapid growth in the number of mobile devices and the higher data rate requirements of mobile subscribers have made wireless signal processing a key driving application of mobile computing technology. To design better mobile platforms and the supporting wireless infrastructure, it is ve ..."
Abstract
- Add to MetaCart
Abstract-The rapid growth in the number of mobile devices and the higher data rate requirements of mobile subscribers have made wireless signal processing a key driving application of mobile computing technology. To design better mobile platforms and the supporting wireless infrastructure, it is very important for computer architects and system designers to understand and characterize the performance of existing and upcoming wireless protocols. In this paper, we present a newly developed open-source benchmark suite called WiBench. It consists of a wide range of signal processing kernels used in many mainstream standards such as 802.11, WCDMA and LTE. The kernels include FFT/IFFT, MIMO, channel estimation, channel coding, constellation mapping, etc. Each kernel is a self-contained configurable block which can be tuned to meet the different system requirements. Several standard channel models have also been included to study system performance, such as the bit error rate. The suite also contains an LTE uplink system as a representative example of a wireless system that can be built using these kernels. WiBench is provided in C++ to make it easier for computer architects to profile and analyze the system. We characterize the performance of WiBench to illustrate how it can be used to guide hardware system design. Architectural analyses on each individual kernel and on the entire LTE uplink are performed, indicating the hotspots, available parallelism, and runtime performance. Finally, a MATLAB version is also included for debugging purposes.
Styles—Adaptable architectures
"... The constant push for feature richness in mobile and embedded devices has significantly increased computational demand. However, stringent energy constraints typically remain in place. Embedding processor cores in FPGAs offers a path to having customized instruction processors that can meet the perf ..."
Abstract
- Add to MetaCart
(Show Context)
The constant push for feature richness in mobile and embedded devices has significantly increased computational demand. However, stringent energy constraints typically remain in place. Embedding processor cores in FPGAs offers a path to having customized instruction processors that can meet the performance and energy demands. Ideally, the customization process should be automated to reduce the design effort, and indirectly the time to market. However, the automatic generation of custom extensions for floating point computation remains a challenge in FPGA codesign. We propose an approach for accelerating such computation via application-specific SIMD extensions. We describe an automated co-design toolchain that generates code and application-specific platform extensions that implement SIMD instructions with a parameterizable number of vector elements. The parallelism exposed by encapsulating computation in vector instructions is matched to an adjustable pool of execution units. Experiments on actual hardware show significant performance improvements. Our framework provides an important extension to the capabilities of embedded processor FPGAs which traditionally dealt with bit, integer, and low intensity floating point code, to now being able to handle vectorizable floating point computation.
Graduate Supervisory Committee:
, 2014
"... Stream processing has emerged as an important model of computation especially in the context of multimedia and communication sub-systems of embedded System-on-Chip (SoC) architectures. The dataflow nature of streaming applications allows them to be most naturally expressed as a set of kernels iterat ..."
Abstract
- Add to MetaCart
Stream processing has emerged as an important model of computation especially in the context of multimedia and communication sub-systems of embedded System-on-Chip (SoC) architectures. The dataflow nature of streaming applications allows them to be most naturally expressed as a set of kernels iteratively operating on continuous streams of data. The kernels are computationally intensive and are mainly character-ized by real-time constraints that demand high throughput and data bandwidth with limited global data reuse. Conventional architectures fail to meet these demands due to their poorly matched execution models and the overheads associated with instruction and data movements. This work presents StreamWorks, a multi-core embedded architecture for energy-efficient stream computing. The basic processing element in the StreamWorks ar-chitecture is the StreamEngine (SE) which is responsible for iteratively executing a stream kernel. SE introduces an instruction locking mechanism that exploits the iter-ative nature of the kernels and enables fine-grain instruction reuse. Each instruction