Results 1 -
8 of
8
Multimedia Extensions for General-Purpose Processors
- Proc. IEEE Workshop on Signal Processing Systems
, 1997
"... This paper gives an overview of the multimedia instructions that have been added to the instruction set architectures of general-purpose microprocessors to accelerate media processing. Examples are MAX, MMX and VIS, the multimedia extensions for PA-RISC, ix86, and SPARC processor architectures. We d ..."
Abstract
-
Cited by 47 (9 self)
- Add to MetaCart
(Show Context)
This paper gives an overview of the multimedia instructions that have been added to the instruction set architectures of general-purpose microprocessors to accelerate media processing. Examples are MAX, MMX and VIS, the multimedia extensions for PA-RISC, ix86, and SPARC processor architectures. We describe subword parallelism, a low overhead form of SIMD parallelism, and the classes of instructions needed to support subword parallel computations efficiently. Features described include arithmetic operations with saturation, averaging, multiply alternatives, data rearrangement primitives like Permute and Mix, formatting instructions, conditional execution, and complex instructions. 1. INTRODUCTION The general-purpose information processing workload is changing to include an increasing amount of media processing. Media processing is the processing of digital multimedia information, such as images, video, 2-dimensional and 3dimensional graphics, animations, audio and text. The definition...
Exploiting cache in multimedia
- In International Conference on Computing and Systems
, 1999
"... ..."
Mapping of Application Software to the Multimedia Instructions of GeneralPurpose Microprocessors
, 1997
"... : This paper describes how media processing programs may be accelerated by using the multimedia instruction extensions that have been added to general-purpose microprocessors. As a concrete example, it describes MAX2, a minimalist, second-generation set of multimedia instructions included in the PA- ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
: This paper describes how media processing programs may be accelerated by using the multimedia instruction extensions that have been added to general-purpose microprocessors. As a concrete example, it describes MAX2, a minimalist, second-generation set of multimedia instructions included in the PA-RISC 2.0 processor architecture. MAX2 implements subword parallel instructions, which utilize the microprocessor's 64-bit wide datapaths to process multiple pieces of lower-precision data in parallel. It also includes innovative, new instructions like Mix, which are very useful for matrix transpose and other common data rearrangements. The paper examines some typical multimedia kernels, like Block Match, Matrix Transpose, Box Filter and the IDCT, coded with and without the MAX2 instructions, to illustrate programming techniques for exploiting subword parallelism and superscalar instruction parallelism. The kernels using MAX2 show significant speedups in execution time, and more efficient ut...
CoMPARE: A Simple Reconfigurable Processor Architecture Exploiting Instruction Level Parallelism
- In Proceedings of the 5th Australasian Conference on Parallel and Real-Time Systems
, 1998
"... ..."
Architectural Support For Long Integer Modulo Arithmetic on RISC-Based Smart Cards
- INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS
, 2003
"... Various algorithms for public-key cryptography, such as the Rivest–Shamir–Adleman or Diffie–Hellman algorithms, are based on long integer arithmetic operations, most notably modulo multiplication. To be adequate for long-term security, the modulus should have a length of at least 1024 bits. Long int ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
Various algorithms for public-key cryptography, such as the Rivest–Shamir–Adleman or Diffie–Hellman algorithms, are based on long integer arithmetic operations, most notably modulo multiplication. To be adequate for long-term security, the modulus should have a length of at least 1024 bits. Long integer arithmetic is difficult to implement efficiently in software, particularly on smart cards due to their constrained resources and relatively slow clock frequency. In this paper we investigate the potential of application-specific instruction set extensions for cryptographic workloads such as long integer arithmetic. We define two special instructions that carry out computations of the form a×b+c+d, whereby a, b, c, d are single-precision
words (unsigned integers). These additional instructions can be executed on an optimized multiply/accumulate unit and therefore they are simple to incorporate into common RISC architectures such as the MIPS32. The proposed extensions cause almost no speed or area penalty since no extra functional units are required. Experimental results indicate that the inner-loop operation of a multiple-precision multiplication can be accelerated by a factor of almost 2. We also estimate the execution time of a 1024-bit modulo exponentiation assuming that these special instructions were made available. The presented concept is an alternative solution to a crypto co-processor, especially for multi-application smart cards (e.g. Java cards) with an embedded 32-bit RISC core.
A Compiler Target Model for Line Associative Registers
"... Line Associative Registers (LARs) are the basis for a new class of processor architectures in which memory accesses are minimized by explicitly managing wide lines of instructions and data in processor registers. The design of LARs has signi cant commonality with a number of existing technologies w ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Line Associative Registers (LARs) are the basis for a new class of processor architectures in which memory accesses are minimized by explicitly managing wide lines of instructions and data in processor registers. The design of LARs has signi cant commonality with a number of existing technologies which have been more or less widely adopted, however, we firmly believe that LARs-based design, which will employ a highly unusual execution model discussed in the remainder of this paper, has ever increasing potential for performance gains over conventional designs utilizing hierarchies of caches and registers. In order to effectively test and utilize this new design, suitable development tools must be written. This paper attempts to describe the implications of a LARs-based architecture for compiler writers, and demonstrate that the benefits of such a design can be harnessed with the use of conventional programming languages. At this time, a HDL verification model implementing a simple LARs-based architecture has been completed, and progress has begun on developing a set of software development tools based on the LLVM compiler infrastructure is underway.
A Study of the Performance Potential for Dynamic Instruction Hints Selection
"... Abstract. Instruction hints have become an important way to communicate compile-time information to the hardware. They can be generated by the compiler and the post-link optimizer to reduce cache misses, improve branch prediction and minimize other performance bottlenecks. This paper discusses diffe ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Abstract. Instruction hints have become an important way to communicate compile-time information to the hardware. They can be generated by the compiler and the post-link optimizer to reduce cache misses, improve branch prediction and minimize other performance bottlenecks. This paper discusses different instruction hints available on modern processor architectures and shows the potential performance impact on many benchmark programs. Some hints can be effectively selected at compile time with profile feedback. However, since the same program executable can behave differently on various inputs and performance bottlenecks may change on different micro-architectures, significant performance opportunities can be exploited by selecting instruction hints dynamically. 1
DESIGN AND IMPLEMENTATION OF THE INSTRUCTION SET ARCHITECTURE FOR DATA LARS
, 2010
"... The ideal memory system assumed by most programmers is one which has high capacity, yet allows any word to be accessed instantaneously. To make the hardware approximate this performance, an increasingly complex memory hierarchy, using caches and techniques like automatic prefetch, has evolved. Howev ..."
Abstract
- Add to MetaCart
(Show Context)
The ideal memory system assumed by most programmers is one which has high capacity, yet allows any word to be accessed instantaneously. To make the hardware approximate this performance, an increasingly complex memory hierarchy, using caches and techniques like automatic prefetch, has evolved. However, as the gap between processor and memory speeds continues to widen, these programmer-visible mechanisms are becoming inadequate. Part of the recent increase in processor performance has been due to the introduction of programmer/compiler-visible SWAR (SIMD Within A Register) parallel processing on increasingly wide DATA LARs (Line Associative Registers) as a way to both improve data access speed and increase efficiency of SWAR processing. Although the base concept of DATA LARs predates this thesis, this thesis presents the first instruction set architecture specification complete enough to allow construction of a detailed prototype hardware design. This design was implemented and tested using a hardware simulator.