| Digital Equipment Corporation, Maynard, MA. DECchip 21064-AA Microprocessor Hardware Reference Manual, 1st edition, October 1992. Order number EC-N0079-72. |
....in perspective, we show, the memory hierarchy figures of the DEC 4000 AXP series of workstations. Table 1. 1 lists the size of each level of the memory hierarchy, measured in kilobytes (KB) and the corresponding latency, measured in nanoseconds (ns) Data used in the table is based on that in [14, 27] (160 megahertz (MHz) DECchip 21064 CPU, maximum memory cache sizes, bus latency included) Note that each level in the memory hierarchy is approximately two orders of magnitude larger than the immediately higher level and one order of magnitude slower. In some systems, the memory hierarchy is ....
....Cache Network interface Primary memory 1 2 3 Figure 3.6: Memory network outgoing message path The logical hardware structure of all three versions of the simulator is shown in Figure 3.7. The CPU, cache, write buffer, and bus are based specifically on the DECchip 21064 AA implementation [14] of the Alpha architecture [13] In addition to the modules shown in Figure 3.7, we simulate a network fringe (Figure 3.8) The network fringe embodies the actions of the remainder of the multicomputer (everything external to the node) Specifically, it sends messages to the network interface at a ....
[Article contains additional citation context not shown here]
Digital Equipment Corporation, Maynard, MA. DECchip 21064-AA Microprocessor Hardware Reference Manual, 1st edition, October 1992. Order number EC-N0079-72.
....reported in Chapters 5, 6, and 8, we tried two or more input sets for most benchmark programs and found the results to be stable across the runs. 4. 2 Measurement Technique The run time measurements in this thesis have been done using Zippy [46] a cyclelevel simulator for the DEC Alpha 21064 [35]. We had originally used direct measurements rather than simulations but found that the variation between the runs due to context switches and spurious conflicts in the small direct mapped caches obscured the performance impact of our optimizations. Zippy is a highly detailed simulator for the ....
Digital Equipment Corporation, Maynard, MA. DECchip 21064-AA Microprocessor Hardware Reference Manual, first edition, Oct. 1992. 130
....a PALcode trap is only two pipeline drains) and simplifies the implementation since PALcode executes with all internal chip registers available and all interrupts masked. Interrupt dispatching is also performed in PAL mode. The scheduler is implemented as an Alpha AXP software interrupt handler [DEC92], and so executes in the protection domain of the currently running domain. The scheduler is always the last pending interrupt to be serviced, and executes with all interrupts masked. 4.4 The Nemesis Scheduler The operation of the scheduler can now be described. 4.4.1 Scheduler Domain States ....
Digital Equipment Corporation. DECchip 21064-AA Microprocessor Hardware Reference Manual, 1st edition, October 1992. Order Number EC-N0079-72. (p 54)
....are fetched and source operands are read, they are dispatched through a sparse crossbar to the reservation stations in the execution core. In order to reduce the complexity of the dispatch crossbar, each dispatch slot is reserved for particular instruction types, similar to the Alpha 21064 [6]. All instruction alignment to dispatch slots is performed by the optimizing back end. 10 4.2 Execution Core The execution core of both the baseline machine model and the Turboscalar microachitecture are idealized, yielding higher achievable performance. Execution units are clustered by function ....
Digital Equipment Corporation, DECchip 21064-AA Microprocessor Hardware Reference Manual, 1992.
.... see the dynamic behaviors of the machine while simulation is taking place. These two design objectives led to capabilities that we called simulator compilation and trace animation. Simulators for a number of real machines were implemented using VMW. These machines include the Alpha AXP 21064 [Dig92] and 21164 [BK95] the RS 6000 [Gro90] the PowerPC 601 [IBM93] and 620 [DNS95] and a multithreaded version of the PowerPC 620 developed at CMU. Once the architecture description files are in place, for each new microarchitecture, approximately 1 3 months are needed to write and debug the machine ....
Digital Equipment Corporation. DECchip 21064-AA Microprocessor Hardware Reference Manual, 1992.
....performance of heap allocation on newer machines. As CPUs get faster relative to main memory, memory subsystem performance becomes even more crucial to good performance. To address the increasing discrepancy between CPU speeds and main memory speeds, newer machines, such as Alpha workstations [20], often have features such as secondary caches, stream buffers, and register scoreboarding. Secondary caches improve performance by reducing accesses to main memory. Stream buffers and scoreboarding improve performance by reducing the latency of cache misses. The impact of these features on ....
Digital Equipment Corporation. DECchip 21064-AA Microprocessor Hardware Reference Manual, first ed. Maynard, MA, Oct. 1992.
.... tries to minimize how often the Write Buffer becomes full, so as not to stall the pipeline due to writes waiting for an available buffer slot [16] In the 21064 design, the Write Buffer becomes stale when it contains one entry and 256 cycles have passed since the last Write Buffer operation [9]. If this occurs, the control logic tries to empty the head buffer entry. With a larger L3 cache incurring fewer off chip misses, the injection rates increase, resulting in fewer occurrences of the Write Buffer becoming stale. If, for a particular L2 cache size and application, this results in a ....
Digital Equipment Corporation, DECchip 21064-AA Microprocessor Hardware Reference Manual, 1992.
....the best performance occurs when the allocation space is 512K bytes or less, which suggests that there is a penalty being paid for allocation misses. It appears that there are frequent flushes of the write buffer before a full block is allocated. There are several reasons to flush the write buffer [9], but we identify the following as the most likely cause of the problem. The first level cache policy is read allocate, so when there is a read miss, the block containing the address must be fetched. If the block is in the write buffer, but is partially filled, the write buffer must be flushed, ....
Digital Equipment Corporation, Maynard, Massachussets. DECchip 21064 --- AA Microprocessor Hardware Reference Manual, first edition, October 1992. Order number EC-N0079-72.
....to resources not normally visible in kernel mode. Table 1: 21064 Performance Monitor Event Sources The 21064 performance monitor essentially consists of a counter that, every 4096 ticks, resets itself and generates an interrupt. The counter can be multiplexed among 17 different event sources [Dig92]. The event sources are described in Table 1. Figure 3 schematically shows the relationship between event sources relating to CPU instruction issue performance. Suppose we observe the CPU for 1000 cycles. During some of these cycles the CPU is able to issue two instructions simultaneously, during ....
Digital Equipment Corporation. DECchip 21064-AA Microprocessor Hardware Reference Manual. Digital Press, Maynard, Massachusetts, first edition, October 1992. Order number EC-N0079-72.
....times, so we report only one number. DI: Disabling all interrupts before entering an atomic sequence and enabling all interrupts after leaving it; i.e. this scheme does not admit interrupt priority levels. The Alpha architecture does not support this. However, the implementation described in [Dig92] provides the required facilities in a chip specific fashion. These low level facilities normally are not accessible to kernel level software. The measurements therefore had to be performed in a small stand alone system. For the PA RISC, the technique was measured in a Mach kernel that was ....
Digital Equipment Corporation. DECchip 21064-AA Microprocessor Hardware Reference Manual. Digital Press, Maynard, Massachusetts, first edition, October 1992. Order number EC-N0079-72.
....are written to the same entry in the write buffer. This is a feature known as write merging [24] Second, since main memory access time is roughly 145 nanoseconds, dividing this by 35 nanoseconds gives an estimated write buffer size of 4. This is corroborated by the Alpha 21064 Reference Manual[7]. This section has provided a detailed examination of the local memory system. We determined that the cost of an off chip memory access is 23 cycles, that the large page size essentially eliminates TLB costs, and that the write buffer contains four entries and supports write merging. Later ....
Digital Equipment Corporation. DECchip 21064-AA Microprocessor Hardware Reference Manual, 1992.
....words (corresponding to strided or scatter gather references in a vector machine) inherently inefficient. More importantly, they typically allow only one or a small number of outstanding references to memory, limiting the ability to pipeline requests in large systems. For example, the DEC 21064 [11] and 21164 [12] on which the Cray T3D and T3E are based, allow a maximum of one and two outstanding cache line fills from memory, respectively. Microprocessors often lack sufficiently large physical address spaces for use in large scale machines. The DEC 21064, for example, implements a 33 bit ....
Digital Equipment Corporation, DECchip 21064-AA Microprocessor Hardware Reference Manual, 1992.
....a higher bandwidth network. Figure 1 summarizes these differences. As we will show, these advantages do not always translate to better language level performance. First, we discuss the two machines in more detail. The Cray T3D [4, 10, 11] is built around the 150MHz dual issue Alpha 21064 processor [8, 16] and is scalable to upto 2048 processors. It has an 8 KB L1 cache, but, unlike workstations built with the 21064, no off chip L2 cache. The L2 cache is omitted to allow for lower memory latency on a cache miss. The global address space is implemented using a combination of hardware augmentations, ....
Digital Equipment Corporation. DECchip 21064-AA Microprocessor Hardware Reference Manual, 1992.
....64K byte I 64K D 4 d m yes no WV Fetches 8 blocks (32 bytes) on a read miss. 13] Alpha 21064 1992 8K I 8K D 32 d m no no# On chip first level cache, with on chip control for second level cache. Load instruction (for explicit cache line allocation by prefetch) semi stalls. #Write around. [12] Alpha 21164 1994 8K I (L1) 8K D (L1) 96K (L2) 32 32 32 64 d m d m 3 way no no# fetch On chip control for a third level direct mapped cache. Load instruction, usable for explicit cache line allocation by prefetch, is nonblocking. DEC3000 500 1992 8K I(L1) 32 d m Uses an Alpha 21064 CPU. 15] 8K ....
Digital Equipment Corporation, Maynard, Massachussets. DECchip 21064 --- AA Microprocessor Hardware Reference Manual, first edition, October 1992. Order number ECN0079 -72.
....implementations. For example, the PowerPC 603 requires between 2 to 6 cycles [3] a 1:3 ratio, for multiply depending on the operands while division is a fixed 37 cycles. In contrast, the Alpha 21064 uses a software division algorithm with a best case of 16 cycles and a worst case of 144 cycles [4]. This is a 1:9 ratio and the largest ALU instruction best case to worst case ratio we know of in current processors. For floating point operations the PowerPC 603 has a single cycle throughput with a fixed three cycle latency for all operations except for division. Division is not pipelined and ....
....for division. Division is not pipelined and requires a fixed 18 cycles for single precision and 33 for double. Single cycle latency with multiple cycle throughput on floating point multiplication is standard in the popular processors as is a longer, but fixed delay for division (DEC Alpha 21064 [4], Intel Pentium [1] MIPS R4400 [18] The conclusions we can draw on variability due to instruction data dependencies is that most instructions add no variability. Shift, multiplication, and division instructions can contribute variance in some processors, however, if one or more operands are ....
Digital Equipment Corporation, editor. DECchip 21064-AA Microprocessor Hardware Reference Manual. Digital Equipment Corporation, 1992.
....We chose the Digital Alpha [7] as our architecture for the instruction scheduling problem. When introduced it was the fastest scalar processor available, and from an instruction dependence analysis standpoint its instruction set is simple. The 21064 implementation of the instruction set [4] is interestingly complex, having two dissimilar pipelines and the ability to issue two instructions per cycle (also called dual issue) if a complicated collection of conditions hold. Instructions take from one to many tens of cycles to execute. We used the SPEC95 benchmark suite, which consists ....
Digital Equipment Corporation, Maynard, MA. DECchip 21064-AA Microprocessor Hardware Reference Manual, first edition, October 1992.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC