23 citations found. Retrieving documents...
Samuel Larsen and Saman Amarasinghe. Exploiting superword level parallelism with multimedia instruction sets. In Conference on Programming Language Design and Implementation, pages 145--156, June 2000.

 Home/Search   Document Details and Download   Summary   Related Articles   Check  

This paper is cited in the following contexts:
The Architecture of the DIVA Processing-in-Memory Chip - Draper, Chame, Hall.. (2002)   (3 citations)  (Correct)

....real time embedded operating system. We have developed a prototype compiler for the DIVA PIMs, which takes as input sequential Fortran or C code, and produces DIVA executables that exploit both the scalar and WideWord unit. We leverage the SUIF compiler, including extensions described in [18] and our own implementation of transformations described in [6] and a GCC backend for the PowerPC AltiVec. A system level compiler is an area of future work. 6. RELATED WORK The DIVA system architecture is focused on achieving the following four goals: 1) developing PIMs that can serve as the ....

....the pipeline and slow down the clock rate. All other implementations have separate scalar and WideWord units and register files, and other than DIVA, only SSE2 includes transfers between register files. The absence of such capability was reported to be a performance bottleneck in the AltiVec [18]. AltiVec and ASAP support only general permutations, where permutation vectors are read from memory or constructed by instructions. Both SSE2 and DIVA can avoid these costs of deriving a permutation vector through hardwired permutation operations. In the case of SSE2, permutation operations can ....

S. Larsen and S. Amarasinghe. Exploiting superword-level parallelism with multimedia instruction sets. In Proceedings of the ACM Conference on Programming Languages Design and Implementation, 2000.


Bitwidth Aware Global Register Allocation - Tallam, Gupta (2002)   (Correct)

....bit section referencing in context of variables that already contained packed data. They do not carry out any additional variable packing as described in this paper. Some multimedia instruction sets support long registers which can hold multiple words of data for carrying out SIMD operations [5, 11]. Compiler techniques allocate array sections to these registers. In contrast, our work is aimed at shrinking scalars to subword entities and packing them into registers which are one word long. The scalar variables that we handle are ignored by superword techniques. Finally in context of embedded ....

S. Larsen and S. Amarasinghe, "Exploiting Superword Level Parallelism with Multimedia Instruction Sets," ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 145--156, Vancouver B.C., Canada, June 2000.


Current Research Efforts in Media ISA Development - Lappalainen, Liuha, Hämäläinen   (Correct)

.... aggregate data elements are of the same size as the machine word (i.e. of processor s native word length) then individual bit fields are referred to as sub words [16] On the other hand, if the aggregate data elements are larger than the size of a machine word, then they are called super words [15]. The corresponding terms sub word level parallelism and super word level parallelism are commonly used. HP s MAX 1 extension was the pioneer. It was later accompanied by Sun s VIS, HP s MAX 2, DEC s MVI, Intel s MMX, MIPS MDMX, Motorola s Altivec, Intel s SSE and SSE2, and AMD s 3DNow , for ....

....registers to eliminate unnecessary memory accesses. They build on the work of Larsen and Amarasinghe by evaluating the effectiveness of these super word level locality optimisations through an implementation integrated with the compiler algorithm for exploiting super word level parallelism [15]. 6. IMPROVED DATA MANAGEMENT TECHNIQUES Slingerland and Smith observe that because SIMD architectures apply the same operation to all sub words in a packed register, there are many cases where data is not optimally organized after loading from memory. Thus, they propose strided load and store ....

[Article contains additional citation context not shown here]

S. Larsen and S. Amarasinghe, "Exploiting superword level parallelism with multimedia instruction sets," Proc. of the ACM SIGPLAN '00 Conf. on Programming Language Design and Implementation, pp. 145-156, Jun. 2000.


Bit Section Instruction Set Extension of ARM for Embedded.. - Li, Gupta (2002)   (Correct)

....reduced by 12 to 50 for functions that can take advantage of BSX instructions. 5. RELATED WORK A wide variety of instruction set support has been developed to support multimedia and network processing applications. Most of these extensions have to do with exploiting subword [5] and superword [11] parallelism. The instruction set extensions proposed by Yang and Lee [22] focus on permuting subword data that is packed together in registers. The network processor described [15] also supports bit section referencing. In this paper we carefully designed an extension consisting of a small subset ....

....and cast.encoder [20] Their work exploits bit section referencing in context of variables that already contain packed data. They do not carry out any additional variable packing. Compiler techniques for carrying out SIMD operations on narrow width data packed in registers can be found in [4, 11]. 6. CONCLUSIONS We presented the design of the Bit Section eXtension (BSX) to the ARM processor which can be easily encoded into the free encoding space of the ARM instruction set. We found that bit sections are frequently manipulated by multimedia and network data processing codes. Therefore ....

S. Larsen and S. Amarasinghe, "Exploiting Superword Level Parallelism with Multimedia Instruction Sets," ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 145--156, Vancouver B.C., Canada, June 2000.


Energy aware Compilation for DSPs with SIMD instructions - Lorenz, Wehmeyer, Dräger (2002)   (Correct)

....However, there is a need for very complex techniques for analyzing the source program. In addition, the inserted assembly code instructions have a great impact on the code generation phase which is performed in a subsequent step. Vectorization on basic block level Due to the complex analysis, [15, 12] proposed to perform the vectorization on basic block level. Here, the parallelism is increased by unrolling a loop times. The loop unrolling factor can be determined for instance by the number of parallel data paths of the processor. After that, instructions which can be executed as SIMD ....

S. Larsen and S. Amarasinghe. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. In Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation, Vancouver, Canada, 2000.


Cool-Fetch: A Compiler-Enabled IPC Estimation Based.. - Unsal, Koren..   (Correct)

....into a long instruction. The literature for VLIW compilers is vast, for the sake of brevity we refer the reader to Schlansker et al. 33] for compiler architecture interaction techniques which achieve high levels of ILP in VLIW processors. IPC estimation is similar to superword level parallelism [20] in the sense that it can be pro table when inherent ILP is scarce. Energy reduction through ILP monitoring is a fertile research area. Most approaches use hardware based heuristics to predict ILP behavior based on past pro ling information. This dynamic only estimation is then used to drive a ....

Larsen S., Amarasinghe S., \Exploiting Superword Level Parallelism With Multimedia Instruction Sets," In Proceedings of the SIGPLAN '00 Conference on Programming Language Design and Implementation, June 2000.


Data Compression Transformations for Dynamically Allocated.. - Zhang, Gupta (2002)   (5 citations)  (Correct)

....heuristics are sucient to determine that the data is likely to be compressible. ISA extensions have been developed to eciently process narrow width data including Intel s MMX [9] and Motorola s AltiVec [11] Compiler techniques are also being developed to exploit such instruction sets [8]. However, the instructions we require are quite di erent from MMX instructions because we must handle partially compressible data structures and we must also handle pointer data. 7 Conclusions In conclusion we have introduced a new class of transformations that apply data compression techniques ....

S. Larsen and S. Amarasinghe, \Exploiting Superword Level Parallelism with Multimedia Instruction Sets," ACM SIGPLAN Conf. on Programming Language Design and Implementation (PLDI), pages 145-156, Vancouver B.C., Canada, June 2000.


Bitwise: Optimizing Bitwidths Using Data-Range Propagation - Stephenson (2000)   (2 citations)  (Correct)

....intfir 128 79 68 newlife 192 62 60 parity 128 29 29 pmatch 128 30 21 sor 96 29 28 Table 6.2: The actual number of bits in the progam before and after bitwidth analysis. The dynamic lower bound which was obtained by runtime profiling is included for reference. higher degrees of parallelism [14]. In this context, the spectrum shows which applications will have the best prospect for packing values into sub word instructions. 42 0 20 40 60 80 100 softfloat adpcm bubblesort life intmatmul jacobi median mpegcorr sha bilinterp convolve histogram intfir newlife ....

Samuel Larsen and Saman Amarasinghe. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. In Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation, Vancouver, BC, June 2000.


C Compiler Design for a Network Processor - Wagner, Leupers (2001)   (1 citation)  (Correct)

....intended to meet the high code quality demands of embedded systems, have already been developed. These include code generation for irregular data paths [3] 4] 5] 6] 7] address code optimization for DSPs [8] 9] 10] 11] and exploitation of multimedia instruction sets [12] 13] [14]. It has been shown experimentally, that such highly machinespeci c techniques are a promising approach to generate high quality machine code, whose quality often comes close to hand written assembly code. Naturally, this has to be paid with increased compilation times in many cases. While ....

S. Larsen, S. Amarasinghe, \Exploiting Superword Level Parallelism with Multimedia Instruction Sets," ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2000.


Width-Sensitive Scheduling for Resource-Constrained VLIW.. - Nakra, Childers, Soffa (2000)   (1 citation)  (Correct)

....support up to 64 bits of data, the applications running on these processors rarely require the entire data width. There has been some work with compiler optimization and computer architecture to exploit data width by packing several operations together to execute on a single functional unit (FU) [2, 4]. Most of this work has focused on superscalar processors utilizing sub word parallelism to pack similar (homogeneous) operations with narrow operands. This paper evaluates packing multiple narrow operands on resource constrained VLIW processors. In particular, we present a static technique, ....

....in hardware and compiler support for synthesizing SIMD like instructions without user intervention. Brooks et al. 2] proposed an architecture that dynamically packs narrow integer operations on an FU, similar to a parallel sub word operation. Compiler support has been proposed by Larsen et al. [4] to synthesize SIMD instructions from basic block statements. Both of these recent studies are oriented toward superscalar processors. In this paper, we exploit smaller operand widths for VLIW processors. One of the observations in previous studies is that multimedia benchmarks benefit greatly ....

[Article contains additional citation context not shown here]

Samuel Larsen and Saman Amarasinghe. Exploiting superword level parallelism with multimedia instruction sets. In Proceedings of the ACM SIGPLAN '00 Conference on Programming Language Design and Implementation, pages 145--156, Vancouver, BC, June 18--21, 2000.


Increasing and Detecting Memory Address Congruence - Larsen, Witchel, Amarasinghe (2002)   (4 citations)  Self-citation (Larsen Amarasinghe)   (Correct)

.... offered in the Pentium II and Pentium III require six to nine extra cycles if the data cross a cache line boundary [11] In previous work, we presented a compiler algorithm that automatically extracts SIMD parallelism from sequential programs without using complicated vectorization techniques [13]. Recently, this effort has been extended by Shin et al. 16] One of the main objectives of our approach is to combine multiple sequential memory references into a single wide operation. Since congruence information specifies the cache line locations of memory references, it is used to ensure ....

....on various datatypes packed into a 128 bit superword. Therefore, the effective vector length depends on the size of the elements. Ideally, we would like to compare parallelization in the presence of congruence information to parallelization when it is absent. The SIMD compiler we presented in [13] is completely dependent on congruence information to achieve parallelization. As a result, it is impossible to isolate its contribution to the speedups we observe. Instead, we have chosen to use a commercial vectorizer for this study. The VAST compiler [1] can still provide performance gains in ....

S. Larsen and S. Amarasinghe. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. In Proceedings of the SIGPLAN '00 Conference on Programming Language Design and Implementation, pages 145--156, Vancouver, BC, June 2000.


Techniques for Increasing and Detecting Memory Alignment - Larsen, Witchel, Amarasinghe (2001)   Self-citation (Larsen Amarasinghe)   (Correct)

.... o ered in the Pentium II and Pentium III require six to nine extra cycles if the data cross a cache line boundary [11] In previous work, we presented a compiler algorithm that automatically extracts SIMD parallelism from sequential programs without using complicated vectorization techniques [13]. We use alignment information to ensure that all wide memory operations fall on a natural boundary. In our approach, alignment information greatly simpli es the parallelization algorithm. 2.2 Compilation for Banked Memory Architectures Global wire delay will soon become a signi cant problem for ....

....instructions operate on various datatypes packed into a 128 bit superword. Therefore, the e ective vector length depends on the size of the elements. Ideally, we would like to compare parallelization when alignment is known to parallelization when it is unknown. The SIMD compiler we presented in [13] is completely dependent on alignment information to achieve parallelization. As a result, it has not been optimized to generate ecient code for unaligned loads and stores. In an attempt to isolate the e ects of alignment, we have used a commercial vector8 O0 O1 O2 O3 oat (32 bit) 1.76 1.64 ....

S. Larsen and S. Amarasinghe. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. In Proceedings of the SIGPLAN '00 Conference on Programming Language Design and Implementation, pages 145-156, June 2000.


Bitwidth Analysis with Application to Silicon Compilation - Stephenson, Babb.. (2000)   (40 citations)  Self-citation (Amarasinghe)   (Correct)

....register bits that can be saved. As we will see in the next sections, reducing register bits results in smaller datapaths and subsequently smaller, faster, and more efficient circuits. Compilers for multimedia extensions can utilize bitwidth information to extract higher degrees of parallelism [16]. In this context, the spectrum shows which applications will 0 20 40 60 80 100 softfloat adpcm bubblesort life intmatmul jacobi median mpegcorr sha bilinterp convolve histogram intfir newlife parity pmatch sor with Bitwise dyamic profile Figure 12: Percentage of total ....

S. Larsen and S. Amarasinghe. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. In Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation, Vancouver, BC, June 2000.


Evaluating Compiler Technology for Control-Flow.. - Shin, Hall, Chame   (Correct)

No context found.

Samuel Larsen and Saman Amarasinghe. Exploiting superword level parallelism with multimedia instruction sets. In Conference on Programming Language Design and Implementation, pages 145--156, June 2000.


Compiler based Exploration of DSP Energy Savings by .. - Lorenz, Marwedel.. (2004)   (Correct)

No context found.

S. Larsen and S. Amarasinghe. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. In Proc. of PLDI, 2000.


Efficient filtering with the Co-Vector Processor - Dang Nur Engin   (Correct)

No context found.

Samuel Larsen and Saman Amarasinghe, Exploiting Superword Level Parallelism with Multimedia Instruction Sets, Proceedings of the ACM SIGPLAN '00 Conference on Programming Language Design and Implementation, 2000, pp. 145--156.


Exploiting Superword-Level Locality in Multimedia Extension.. - Shin, Chame, Hall (2003)   (Correct)

No context found.

S. Larsen and S. Amarasinghe, "Exploiting superword level parallelism with multimedia instruction sets," in Conference on Programming Language Design and Implementation,(Van- couver, BC Canada), pp. 145--156, June 2000.


Cool-Fetch: A Compiler-Enabled IPC Estimation Based.. - Unsal, Koren..   (Correct)

No context found.

Larsen S., Amarasinghe S., "Exploiting Superword Level Parallelism With Multimedia Instruction Sets," In Proceedings of the SIGPLAN '00 Conference on Programming Language Design and Implementation, June 2000.


A Preliminary Study on the Vectorization of Multimedia.. - Ren, Wu, Padua (2003)   (Correct)

No context found.

Samuel Larsen and Saman Amarasinghe. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. Proceeding of the SIGPLAN Conference on Programming Language Design and Implementation, Vancouver, B.C., June 2000.


Speculative Software Management of Datapath-width.. - Pokam, Rochecouste, .. (2004)   (Correct)

No context found.

Larsen, S., and Amarasinghe, S. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, June 2000.


A Retargetable Preprocessor for Multimedia Instructions - Gilles Pokam Julien (2001)   (1 citation)  (Correct)

No context found.

Samuel Larsen and Saman Amarasinghe. Exploiting Superword Level Parallelism with Multimedia Instruction Sets. In Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation, Vancouver, BC, June 2000.


Compiler-Controlled Caching in Superword Register Files for.. - Shin, Chame, Hall (2002)   (2 citations)  (Correct)

No context found.

S. Larsen and S. Amarasinghe. Exploiting superword level parallelism with multimedia instruction sets. In Conference on Programming Language Design and Implementation, pages 145--156, Vancouver, BC Canada, June 2000.


A Representation for Bit Section Based Analysis and.. - Gupta, Mehofer,, Zhang (2002)   (Correct)

No context found.

S. Larsen and S. Amarasinghe, \Exploiting Superword Level Parallelism with Multimedia Instruction Sets," ACM SIGPLAN Conf. on Programming Language Design and Implementation (PLDI), pages 145-156, Vancouver B.C., Canada, June 2000.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC