32 citations found. Retrieving documents...
G. Beck, D. Yen, and T. Anderson. The Cydra 5 minisupercomputer: Architecture and implementation. The Journal of Supercomputing, 7(1/2):143--180, 1993.

 Home/Search   Document Not in Database   Summary   Related Articles   Check  

This paper is cited in the following contexts:

First 50 documents

Reducing the Impact of Spill Code - Harvey   (Correct)

....much less time than a reference to an element not in the cache. A reference that hits the cache typically completes in a single cycle, while a reference that misses takes five to ten cycles on a uniprocessor machine, and as long as hundreds of cycles in a distributed memory multiprocessor [28, 36, 32, 2, 4]. This difference in access time has a strong impact on the performance of individual programs. Accordingly, much recent research in compilation has been directed at techniques that improve the likelihood of references hitting in the cache. Most of this work falls into two major categories. ....

Bary R. Beck, David W.L. Yen, and Thomas L. Anderson. The cydra 5 minisupercomputer: Architecture and implementation. The Journal of Supercomputing, 7, 1993.


Reducing the Impact of Spill Code - Harvey   (Correct)

....much less time than a reference to an element not in the cache. A reference that hits the cache typically com pletes in a single cycle, while a reference that misses takes five to ten cycles on a uniprocessor machine, and as long as hundreds of cycles in a distributed memory multiprocessor [28, 36, 32, 2, 4]. This difference in access time has a strong impact on the performance of individual programs. Accordingly, much recent research in compilation has been directed at techniques that improve the likelihood of references hitting in the cache. Most of this work falls into two major categories. 5 ....

Bary R. Beck, David W.L. Yen, and Thomas L. Anderson. The cydra 5 minisu- percomputer: Architecture and implementation. The Journal of Supercomputing, 7, 1993.


Compiler-Controlled Memory - Cooper, Harvey (1998)   (17 citations)  (Correct)

....less time than a reference to an element not in the cache. A reference that hits the cache typically completes in a single cycle, while a reference that misses takes five to ten cycles on a simple uniprocessor machine, and as long as hundreds of cycles in a distributed memory multiprocessor [14, 15, 22, 17, 1, 2]. This di#erence in access time has a strong impact on the performance of individual programs. Accordingly, much recent research in compilation has been directed at techniques that improve the likelihood of references hitting in the cache. Most of this work falls into two major categories. ....

BaryR.Beck,DavidW.L.Yen,andThomasL. Anderson. The cydra 5 minisupercomputer: Architecture and implementation. The Journal of Supercomputing, 7, 1993.


Height Reduction of Control Recurrences for ILP Processors - Michael Schlansker Vinod (1994)   (22 citations)  (Correct)

....remarks. 2 Overview of basic concepts and notations This section describes some of the basic concepts and notations that are used throughout this paper. 2. 1 Predicated execution Predicated execution has been implemented in the Cydra 5 computer and has been described in a number of papers [17 20, 22]. Predicated execution refers to the conditional execution of an operation based on a boolean valued operand, called a predicate. The operation executes if the predicate input is true and is nullified if it is false. Compare operations are used to calculate a boolean value which is subsequently ....

G.R. Beck, D. W. L. Yen, and T. L. Anderson. The Cydra 5 mini-supercomputer: architecture and implementation. The Journal of Supercomputing 7, 1/2 (1993), 143-180.


Dynamically Scheduled VLIW Processors - Rau (1993)   (26 citations)  (Correct)

....original attraction of this style of architecture is its ability to exploit large amounts of instruction level parallelism (ILP) with relatively simple and inexpensive control hardware. Whereas a number of VLIW products have been built which are capable of issuing six or more operations per cycle [4, 5, 3], it has just not proven feasible to build superscalar products with this level of ILP [18, 2, 14, 8, 7, 6] Furthermore, the complete exposure to the compiler of the available hardware resources and the exact operation latencies permits highly optimized schedules. These very same properties have ....

....it must be possible to back up instruction issue to an earlier point and then resume execution from there correctly. Since we are not addressing precise interrupts, we shall also ignore the topic of speculative execution. Lastly, we shall simplify our discussion by ignoring predicated execution [13, 3]. Predicated execution poses some difficult problems for out of order execution which are unrelated to whether the architecture in question is VLIW. In Section 2 we review dynamic scheduling and outof order execution for UAL programs. In Section 3 we examine the manner in which the semantics of ....

[Article contains additional citation context not shown here]

Beck, G.R., Yen, D.W.L., and Anderson, T.L. The Cydra 5 mini-supercomputer: architecture and implementation. The Journal of Supercomputing 7, 1/2 (May 1993), 143- 180.


Register Allocation for Predicated Pipelining - Using Spiral Graph (2002)   (Correct)

....processors, where the latency gap increases between the operations # H. Itoga is now with Ibaraki Industrial Technology Center, Ibaraki, Japan. and memory references. Register allocation is also more complex since some modern processors have special hardware facilities, such as register renaming[1]. Software pipelining[2] is an optimization method for loop intensive programs using Instruction Level Parallelism (ILP) It schedules the instructions in the iterations in order to overlap partially on the compilation time. Two software pipelining problems have been solved by hardware support: ....

Gary R. Beck, David W. L. Yen, and Thomas L. Anderson. The cydra 5 minisupercomputer: Architecture and implementation. In The Journal of Supercomputing, volume 7 (1--2), pages 143--180, May 1993.


Evaluating the Use of Register Queues in Software Pipelined Loops - Tyson (2001)   (Correct)

....as well as software hardware techniques to ease the integration of RQs into existing instruction set architectures and machine implementations with out of order pipelines. In the context of this research, register queues can most clearly be viewed as a combination of the rotating register le ([2], 22] and register connection [14] concepts. This combination enables a decoupling of the total register space for SP into a small set of architected registers and a large set of physical registers that are organized as circular bu ers and accessed indirectly. By using register queues, the ....

....paper, we concentrate on modulo scheduling, while recognizing that our results can be applied to other scheduling algorithms as well. Rau [22] addressed the naming problem in software pipelined loops by employing a new method of addressing a processor register le in the Cydra 5 minisupercomputer [2]. The Rotating Register File (RR) is a register le that supports compiler managed hardware renaming by adding the register address (speci ed in the instruction) to the contents of an Iteration Control Pointer (ICP) modulo the number of registers in the RR) This register speci er is then used to ....

G. Beck, D. Yen, and T. Anderson. The cydra-5 minisupercomputer: Architecture and implementation. Journal of Supercomputing, 7(1):143-180, May 1993.


Modulo Scheduling, Machine Representations, and.. - Eichenberger (1997)   (Correct)

....compilers spend a significant amount of compilation time scheduling operations, and thus testing for potential resource contentions. When a benchmark suite of 1327 loops from the Perfect Club [13] SPEC 89 [91] and the Livermore Fortran Kernels [65] is scheduled for the Cydra 5 machine [11][27] approximately 50 of the total time is spent modeling the resources (i.e. answering queries such as can this operation be scheduled in this cycle ) the other 50 of the total time is spent scheduling operations (i.e. deciding the order in which operations are scheduled, initiating the ....

....The major advantage of this approach over previous work [9] 68] 76] is that no restrictions are imposed on scheduling algorithms other than the need to satisfy the constraints of the machine itself. Reduced representations for the DEC Alpha 21064 [28] MIPS R3000 R3010 [53] and Cydra 5 [11] indicate potentially 4.0 to 6.9 times faster contention queries, while requiring 22 to 67 of the memory storage used by the original machine descriptions. Dynamic measurements obtained when scheduling 1327 loops of the benchmark suite for the Cydra 5 machine indicate that the queries to the ....

[Article contains additional citation context not shown here]

G. R. Beck, D. W. L. Yen, and T. L. Anderson. The Cydra 5 mini-supercomputer: Architecture and implementation. In The Journal of Supercomputing, volume 7, pages 143--180, 1993.


Reducing The Impact Of Register Pressure On Software Pipelined Loops - Llosa (1996)   (8 citations)  (Correct)

....functional unit on which each operation will execute. The earliest VLIW processors built where the attached array processors [Cha81] The next generation of products were the minisupercomputers: Multiflow s Trace series of machines [CNO 88, CHJ 90] and Cydrome s Cydra 5 [RYYT89, DHB89, BYA93] During the last few years some microprocessors have been built with a VLIW architecture [Int89, PSW91] Fundamentals of Instruction Level Parallel Processing 13 1.4.4 Hybrid Architectures There are architectures that combine some of the characteristics of the previous architecture models and ....

....to ensure that no lifetime is longer than the length of the replicated kernel. This is known as modulo variable expansion [Lam87, Lam88] Fundamentals of Instruction Level Parallel Processing 33 The other solution is to perform hardware renaming by means of rotating register files [DHB89, BYA93] The register allocator works with vector lifetimes, that is, the entire sequence of (scalar) lifetimes defined by a particular operation over the whole execution of the loop. Several heuristics for allocating vector lifetimes with and without rotating register files have been proposed and ....

G.R. Beck, D.W.L. Yen, and T.L. Anderson. The Cydra 5 minisupercomputer: Architecture and implementation. The Journal of Supercomputing, 7(1/2):143-- 180, May 1993.


Efficient State-Diagram Construction Methods for Software.. - Zhang, al. (1999)   (Correct)

....paths as well. However, we find that the RP method to work well in practice. We compare the efficiency of the proposed methods with that of the reduced state diagram construction (RD) method in modeling two real processors, namely the DEC Alpha 21064 processor and the Cydra VLIW processor [3]. Our experimental results reveal that the proposed methods result in significant reduction in the construction time by about 3 to 4 orders of magnitude. For example, the RP and E MIS method took only 1.2 and 1.5 seconds respectively, to construct the first 100,000 distinct latency sequences for ....

G. Beck, D.W.L. Yen, and T. L. Anderson. The Cydra-5 minisupercomputer: Architecture and implementation. Journal of Supercomputing, 7, May 1993.


Resource Usage Models for Instruction Scheduling: Two New.. - Ramanan, Govindarajan (1999)   (Correct)

.... table (CRT) 4, 7, 14] ii) reduced reservation table (RRT) 5] iii) finite state automaton (FSA) 2] and the two new models, namely (iv) group automaton (GA) and (v) dynamic collision matrix (DCM) The two target architectures used in our study are MIPS R8000[11] and the Cydra VLIW processor [3]. Our quantitative study consists using these five resource usage models for the two target architectures in a simple instruction scheduling method and evaluating the spacetime tradeoffs. The instruction scheduling method using the different resource models are tested on a set of 927 basic blocks ....

G. Beck, D.W.L. Yen, and T. L. Anderson. The Cydra5 minisupercomputer: Architecture and implementation. Journal of Supercomputing, 7, May 1993.


Instruction Fetch Mechanisms for VLIW.. - Conte, Banerjia.. (1996)   (6 citations)  (Correct)

....reside in the same frame to ease the requirements on the i fetch mechanism [3] This requires NOPs, thereby violating RSI. The Cydrome Cydra 5 VLIW machine used a split encoding such that instruction cache blocks were composed of either one MultiOpor multiple one Op MultiOps called UniOps [20] [5]. Cache blocks composed of one MultiOp are in an uncompressed form, and those composed of UniOps are padded with NOPs, if needed for cache block alignment. It is also non RSI. Another commercial VLIW architecture, the Multiflow TRACE family of machines, used a compressed encoding [17] Nops were ....

G. R. Beck, D. W. L. Yen, and T. L. Anderson. The Cydra 5 minisupercomputer: architecture and implementation. J. Supercomputing, 7(1):143--180, Jan. 1993.


Systematic Compilation For Predicated Execution - August (2000)   (Correct)

....to IA64 [22] and techniques presented for predicate optimization are applicable with some modi cation [23] 2.3. 3 Predicated execution support in the Cydra 5 The Cydra 5 system is a very long instruction word (VLIW) multiprocessor system utilizing a directed data ow architecture [9] [24]. Each Cydra 5 instruction word contains 26 seven operations, each of which may be individually predicated. An additional source operand added to each operation speci es a predicate located within the predicate register le. The predicate register le is an array of 128 Boolean (one bit) ....

G. R. Beck, D. W. Yen, and T. L. Anderson, \The Cydra 5 minisupercomputer: Architecture and implementation," The Journal of Supercomputing, vol. 7, pp. 143{ 180, January 1993.


Summary of the Scientific Work - Mueller (1999)   (Correct)

....or read simultaneously reside in distinct scratch pads. The compiler has to find an instruction schedule which allows a viable scratch pad allocation. In general, the schedule and the allocation have to be adapted iteratively. The scratch pad design implemented in the Cydra 5 minisupercomputer [3, 6] uses a fixed allocation scheme. The results of a certain function unit are always mapped to the same scratch pads. Conflict free accesses are ensured by duplicating the values several times. That results in a simple but very storage intensive register file design. The paper [C10] presents a ....

G.R. Beck, D.W.L. Yen, and T.L Anderson. The Cydra 5 minisupercomputer: Architecture and implementation. The Journal of Supercomputing, 7(1/2):143--180, 1993.


VLIW Processors: Efficiently Exploiting Instruction Level.. - Rudd (1999)   (Correct)

....15 Save data registers 454 Save special system registers 271 336 Miscellaneous overhead 250 Total exception save 1063 1160 Total exception restore 900 Total exception overhead 1963 2060 Table 2. 1: Exception handler overhead for the Cydra 5 VLIW processor, adapted from Beck et al. [5]. Although the execution behavior required to manage an interruption is reminiscent of a subroutine call, the effect of an interruption is decidedly different and much more costly because of its unscheduled nature. In VLIW processors where the compiler is responsible for schedule correctness, a ....

....for a variable latency memory system. This processor allowed the compiler (or user) to specify the assumed memory latency in the Memory Latency Register and thus to adjust processor behavior to correspond to this memory latency. Special purpose hardware, the Memory Collating Buffer ( 42] and [5]) ensured that the memory system appeared to have the specified latency. When a reference arrived early (and possibly out of order) the Memory Collating Buffer collected the result maintaining the proper ordering between references; when a reference arrived late then the processor stalled until ....

[Article contains additional citation context not shown here]

Gary R. Beck, David W. L. Yen, and Thomas L. Anderson. The Cydra 5 minisupercomputer: Architecture and implementation. The Journal of Supercomputing, 7(1/2):143--180, May 1993.


Efficient State-Diagram Construction Methods for.. - Zhang, Govindarajan.. (1999)   (Correct)

....shown to completely eliminate redundancy in the state diagram. From our experimental results on two real architectures, both of the two methods show a great reduction in state diagram construction time. 1 Introduction Recent studies on modulo scheduling, an instruction scheduling method for loops [9, 10, 15, 16, 3], in a production compiler has reported significant improvement (up to 35 ) in the overall runtime for a suite of SPEC floating point benchmark programs [17] On the other hand, rapid advances in VLSI technology and computer architecture present an important challenge for compiler designers: a ....

....paths as well. However, we find that the RP method to work well in practice. We compare the efficiency of the proposed methods with that of the reduced state diagram construction (RD) method in modeling two real processors, namely the DEC Alpha 21064 processor and the Cydra VLIW processor [3]. Our experimental results reveal that the proposed methods result in significant reduction in the construction time by about 3 to 4 orders of magnitude which means we can provide a reasonable amount of distinct paths within minutes instead of days. Another interesting observation made from our ....

G. Beck, D.W.L. Yen, and T. L. Anderson. The Cydra-5 minisupercomputer: Architecture and implementation. Journal of Supercomputing, 7, May 1993.


Conflict-Free Access to Multiple Single-Ported Register Files - Müller, Vishkin (1997)   (Correct)

....pads. In general, the schedule and the scratch pad allocation have to be adapted iteratively. Supported by DFG y Also affiliated with Tel Aviv University. Partially supported by NSF grant CCR 9416890 The scratch pad design implemented in the processors of the Cydra 5 mini supercomputer [1, 4] uses a fixed allocation scheme (Section 2) The results of a certain function unit are always mapped to the same scratch pads. Conflict free accesses are ensured by storing the values several times. That results in a simple but very storage intensive register file design. Our approach (Section 3) ....

....complicated again. However, it can be computed efficiently by an edge coloring algorithm. Section 4 compares the performance and the hardware complexity of the two designs, based on the formal hardware model of [9] 2. The Context Register Matrix CRM The processors of the Cydra 5 supercomputer [1] implement a VLIW architecture which requires a register file with 12 read and 6 write ports. Instead of a multiport RAM, the processors use a scratch pad design to provide sufficient read write bandwidth. The allocation scheme of the scratch pads is fixed. That makes it much easier to find a ....

G. Beck, D. Yen, and T. Anderson. The Cydra 5 minisupercomputer: Architecture and implementation. The Journal of Supercomputing, 7(1/2):143--180, 1993.


RaPiD - A Configurable Computing Architecture for.. - Ebeling.. (1996)   (3 citations)  (Correct)

....can achieve this at a 100MHz clock rate. We make use of two different and seemingly competing techniques. First, we use memories that support fast burst mode, that is, high bandwidth data transfer to addresses in the same row. Second, we organize memory into randomly interleaved memory banks [10, 3]. Fast burst mode supports mostly sequential memory accesses. Data can usually be stored in memory so that accesses are mostly sequential. For example, all the address streams for matrix multiply are in row major order 3 and therefore mostly sequential. Some applications, however, access data ....

G. R. Beck, D. W. L. Yen, and T. L. Anderson. The Cydra 5 minisupercomputer: architecture and implementation. Journal of Supercomputing, 7(1-2):143--80, 1993.


Hamiltonian Recurrence for ILP scheduling - Barrado, Labarta (1997)   (Correct)

.... Hsu86] perfect pipelining [SuWa91, AiNN95] variants of percolation scheduling [WBHS92, EbNi89, TiLS90, MoEb92] linear programming [GoAG94, Feau94] Also hardware developers offer more facilities for efficient software pipelining: rotating registers, predicated execution, speculative execution [Ebci87, MLCH92, BeYA93, Colw90, RaST92]. Loop are transformed to a parallel form where operations are compacted in a loop kernel, where different iterations overlap their execution. Optimal solutions are those that map all the operations maximizing the operations per cycle rate. The number of cycles of the kernel is known as Initiation ....

Beck G., Yen D. and Anderson T. "The Cydra 5 minisupercomputer: Architecture and implementation. The J. Supercomputing, 7, pp.143-180. 1993.


Lifetime-Sensitive Modulo Scheduling - Huff (1993)   (113 citations)  (Correct)

....the slackscheduling framework. The scheduler s execution time is analyzed in Section 6. Performance measurements are shown in Section 7. Finally, Section 8 offers some comparisons with related work. 2 Target Machine The target machine is a hypothetical VLIW processor similar to Cydrome s Cydra 5 [20, 2], including architectural support for overlapping loops without using code duplication [5] Nevertheless, the scheduling techniques shown in this paper can be directly applied to conventional RISC machines [14, 23] albeit at the expense of code expansion [19] 2.1 Functional Units ....

G. R. Beck, D. W. L. Yen, and T. L. Anderson. The Cydra-5 mini-supercomputer: Architecture and implementation. Journal of Supercomputing, 7(1/2), Jan. 1993.


Register Allocation for Predicated Code - Eichenberger, Davidson (1995)   (8 citations)  (Correct)

....with the lowest achievable register requirements for the given machine, loop, and Modulo Reservation Table (MRT) 1] In this paper, we use the MRT produced by the Iterative Modulo scheduler as input to this scheduler. The machine model used in these experiments corresponds to the Cydra 5 machine [26]. This choice was motivated by the availability of quality code for this 8 16 24 32 40 48 56 64 72 80 88 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Fraction of loops scheduled Registers Schedule Independent Lower Bound Best Post Scheduling Bundling Heuristic MinReg Stage Scheduler Iterative Modulo ....

G. R. Beck et al. The Cydra 5 mini-supercomputer: Architecture and implementation. In The J. of Supercomputing, volume 7, pages 143--180, 1993.


A Reduced Multipipeline Machine Description that.. - Eichenberger, Davidson (1996)   (12 citations)  (Correct)

....requirements can be expressed in terms close to the actual hardware structure of the target machine and the reduced machine description used by the compiler is generated in an error free and automated fashion. Experiments with the DEC Alpha 21064 [19] the MIPS R3000 R3010 [20] the and Cydra 5 [21] machines indicate 4 to 7 times faster contention queries and require 22 to 90 of the memory storage used by the original machine descriptions. These improvements are obtained by using highly reduced machine descriptions instead of the original or manually optimized machine descriptions. Using ....

....of bitvectors that need to be tested to answer a query i.e. the number of nonempty groups of k consecutive cycles. This number, referred to as word usages, is averaged over all operation types and possible alignments. As a proof of concept, we investigated our technique on the Cydra 5 machine [21] which has the most complex resource requirements of the three machines. The machine configuration investigated has 7 functional units: 2 memory port units, 2 address generation units, 1 FP adder unit, 1 FP multiplier unit, and 1 branch unit. The original machine description used by the Cydra 5 ....

G. R. Beck, D. W. L. Yen, and T. L. Anderson. The Cydra 5 mini-supercomputer: Architecture and implementation. In The Journal of Supercomputing, volume 7, pages 143--180, 1993.


In Proc. 11th International Parallel Processing.. - Silvia Mueller..   (Correct)

No context found.

G. Beck, D. Yen, and T. Anderson. The Cydra 5 minisupercomputer: Architecture and implementation. The Journal of Supercomputing, 7(1/2):143--180, 1993.


Reduced Code Size Modulo Scheduling in the Absence of.. - Llosa, Freudenberger (2002)   (1 citation)  (Correct)

No context found.

G.R. Beck, D.W.L Yen, and T.L Anderson. The Cydra 5 minisupercomputer: Architecture and implementation. J. Supercomputing 7,1/2 (May 1993), pp. 143-180.


Code Size Minimization and Retargetable - Assembly For Custom (2000)   (Correct)

No context found.

G. R. Beck, D. W. L. Yen, and T. L. Anderson. The cydra 5 mini-supercomputer: architecture and implementation. The Journal of Supercomputing, 7(1/2):143--180, 1993.

First 50 documents

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC