| Richard L. Sites and Richard T. Witek. Alpha AXP Architecture Reference Manual. Digital Press, 1995. ISBN 1-55558-145-5. |
....language from the ML [5] family of languages. It is a strongly typed, garbage collected, functional language. The compilers provided can produce byte code for a wide variety of Unix variants as well as for Microsoft Windows 95. Additionally, native code compilers are provided for the Digital Alpha [6] and Intel x86 [7] amongst others. Both types of compilers use static type checking. Caml also has a dynamic loader which allows byte code to be loaded into a running byte code program. Since the byte code is machine independent, our active programs are composed of byte code les which may be ....
Richard L. Sites and Richard T. Witek. Alpha AXP Architecture Reference Manual. Digital Press, 2nd edition, 1995.
....degrade, or have no noticeable effect on the performance of a design. uses simple mechanisms to reduce the amount of work added to speculative execution for the purpose of supporting software copy on write. As background, on the architecture targetted by my implementation (the Alpha architecture [50]) a store instruction designates the value to store and the address at which to store it, where the address is the sum of a register value and a constant offset. A load instruction designates the address from which to load a value, where the address is specified as for a store instruction, and ....
Richard L. Sites and Richard T. Witek. Alpha AXP architecture reference manual, second edition. Digital Press, Boston, MA, 1995.
.... There are several relaxed memory consistency models depending on the degree of relaxing the orders of read and write accesses: processor consistency [40] weak ordering [3, 27] release consistency [39] There also are consistency models specific to processor architectures, such as DEC Alpha [84], IBM PowerPC [22] and SUN SPARC [92] Commercial systems that have relaxed memory consistency models include AlphaServer 8200 8400, Cray T3D T3E, SparcCenter 1000 2000, Ultra Enterprise Servers, and Convex SPP systems [2] However, the relaxed memory consistency model makes programming and ....
....optimizations. The commercial systems that have relaxed memory consistency models include AlphaServer 8200 8400, Cray T3D T3E, SparcCen ter 1000 2000, Ultra Enterprise Servers, and Convex SPP systems [2] There also are consistency models specific to processor architectures, such as DEC Alpha [84], IBM PowerPC [22] and SUN SPARC [92] Although the relaxed memory consistency model delivers better performance be cause it makes a wide range of hardware level optimizations possible, it increases the complexity of writing and porting shared memory programs because the programmer has to ....
Richard L. Sites and Richard T. Witek. Alpha AXP Architecture Reference Manual. Digital Press, second edition, 1995.
....different load conditions. The time metric also provides little insight why the program required a particular amount of time to complete. To provide a better understanding of why a computer system has a particular performance, a number of processors, such as the Intel Pentium [6] Compaq Alpha [8], and MIPS R10000 [11] have included hardware performance counters that can count either the occurrences of specific events or the duration of particular conditions. These counters allow the efficient data collection of metrics such as the number of instructions executed and the number of memory ....
....at their normal rate. However, binary rewrites still suffer from poor performance when the program is heavily instrumented, e.g. collecting data on very frequent events such as memory references. Newer processors have reduced the overhead of collecting common performance metrics. The Compaq Alpha [8], Intel Pentium [6] and SGI MIPS 10000 [11] all have performance monitoring registers in the processor core. These registers allow the programmer to choose the metrics collected by these registers. Since no modifications are made to the program, the overhead of data collection is significantly ....
R. L. Sites and R. T. Witek. Alpha AXP Architecture Reference Manual, Digital Press, Newton, Massachusett, second edition, 1995.
....systems. The collected values avoid all time based metrics. Thus, assuming the program has deterministic execution, the results of the simulator driven by the graph will accurately predict the behavior of the proposed computer system. Processors such as the Intel Pentium [Intel97] Compaq Alpha [SiWi95], IBM PowerPC [WeChSh94] and SGI MIPS R10000 [ZaLaTu96] include special performance monitoring registers to count events such as the number of instructions, data memory references, and floating point operations. The use of these registers allow fine grained data contained within a node to be ....
R. L. Sites and R. T. Witek. Alpha AXP Architecture Reference Manual, Digital Press, Newton, Massachusett, second edition, 1995.
....with information about the positioning of its ibloads. For example, in the Alpha ISA an indirect branch has the format displayed in Fig. 5.6. Opcode RA RB Displacement 16 bits 6 bits 5 bits 5 bits Figure 5.6: Indirect branch format in the Alpha ISA. According to the Alpha Reference Manual [147] the PC of the instruction following the indirect branch is written to register RA and then the PC is loaded with the target virtual address (next PC) The target address is supplied by register RB. Since the 14 bit displacement field is not used during the execution of the indirect branch, it can ....
R. Sites and R. Witek. Alpha AXP Architecture Reference Manual. Digital Press, 1995.
....semantics can be modeled by requiring that memory instructions be executed in order but requiring that stores occur non atomically. Compiler oriented memory models such as Location Consistency and DAG Consistency use this approach, as do many processor architecture specifications such as Alpha [17] and PowerPC [12] The prescient stores in the original Java Memory Model effectively use this strategy, allowing a store action to be reordered with respect to a store instruction. We feel that atomic stores are easy to understand and that operation reordering is natural to programmers. Some of ....
R. L. Sites and R. T. Witek, editors. Alpha AXP Architecture Reference Manual (Second Edition). Butterworth-Heinemann, 1995.
.... explicitly or not (e.g. programmer centric models [16, 5, 19] It is the task of the compiler to ensure that the semantics of a high level program is preserved when its compiled version is executed on an architecture with a certain low level memory model (e.g. architecture centric models [25, 18, 26, 14]) The essence of any memorymodel is the correspondencebetween each load instruction and the store instruction that supplies the value retrieved by the load. Unfortunately, at the architecture level, memory access operations often have some sophisticated implementation characteristics that make it ....
R. L. Sites and R. T. Witek, editors. Alpha AXP Architecture Reference Manual (Second Edition). ButterworthHeinemann, 1995.
....a minimal porting effort for the OS (Tru64 Unix) The remaining sections provide more detail about the various modules in the Piranha architecture. 2. 1 Alpha CPU Core and First Level Caches The processor core uses a single issue, in order design capable of executing the Alpha instruction set [39]. It consists of a 500MHz pipelined datapath with hardware support for floating point operations. The pipeline has 8 stages: instruction fetch, register read, ALU 1 through 5, and write back. The 5 stage ALU supports pipelined floating point and multiply instructions. However, most instructions ....
R. L. Sites and R. T. Witek. Alpha AXP Architecture Reference Manual (second edition). Digital Press, 1995.
....the overhead of such synchronization is prohibitive due to the high frequency of the inline checks that must be protected. SMP nodes that support a relaxed memory model further increase synchronization costs, due to the need for expensive fence instructions (called memory barriers on the Alpha [16]) at synchronization points that enforce ordering of memory operations. These fences take a minimum of ten processor cycles for the Alpha multiprocessors used in our experiments. 3.3 Our General Solution to Race Conditions The prohibitive cost of synchronization and fence instructions heavily ....
R. L. Sites and R. T. Witek, editors. Alpha AXP Architecture Reference Manual. Digital Press, 1995. Second Edition.
....time resolution. For the multithreaded operations being investigated, no meaningful data could be collected with the standard software timing subroutines. There are means of obtaining higher resolution timing measurements. Many processors such as the Intel Pentium [Intel 97] and the DEC Alpha [Sites 95] have user visible registers that count processor clock cycles. A small in line assembly language routine was written to read the value of the processor s high resolution timer. The overhead of this code was approximate 44 clock cycles, about 150ns. This provides very precise timing on a single ....
Sites, R. L. and R. T. Witek, Alpha AXP Architecture Reference Manual, Digital Press, Newton, Massachusetts, second edition, 1995.
....for that feature, and concluding with a discussion of alternatives and rationale, often with some examples illustrating intended usage. The examples and discussion of processors assume an IEEE 754 compliant processor with dynamic rounding modes and dynamic trapping status. The Alpha architecture [83] can encode some rounding modes statically in a field of a floating point instruction; the interaction of this feature with Borneo semantics is noted on a number of occasions throughout the text. 6.1. indigenous Allow me to introduce you to Ceti Alpha V s only remaining indigenous lifeform. ....
....modes are defined in the Math class (Java does not have enumerated types) A rounding declaration is effective from the declaration point in a block to the close of that block or to the next rounding declaration. If the expression given to a rounding declaration evaluates to an 16 The Alpha [83] can statically encode three of the four rounding modes into two bits of arithmetic opcodes; the fourth bit pattern is used to take the rounding mode from the FPCR (Floating point Control Register) 17 Borneo semantics preserve the flag effects of expression evaluation. If evaluating an ....
Richard L. Sites, Richard T. Witek, Alpha AXP Architecture Reference Manual, Second Edition , Digital Press, 1995.
....that section will have little effect on the overall performance of the program. Traditionally, expensive function calls were used to obtain time data from off chip sources, causing the measurement to perturb the data being collected. Many processors, including the Intel Pentium [5] Compaq Alpha [10], and PowerPC [8] include timestamp registers, which can be read with a couple of simple instructions and provide time resolution on the order of tens of nanoseconds. This processor hardware reduces the cost of measuring the amount of time required to execute a section of code. The abilities of ....
R. L. Sites and R. T. Witek. Alpha AXP Architecture Reference Manual, Digital Press, Newton, Massachusett, second edition, 1995.
....program code. It uses ATOM to identify procedure boundaries and navigate the text segment. The executable rewriting functionality of ATOM is not used. 3. 1 Classifying relocations Digital UNIX defines a set of register usage conventions that compilers must obey when compiling programs for the Alpha [8,9]. Below we examine different types of program relocation within the context of the Alpha instruction set and Digital UNIX register usage conventions. Note that the term data object is used to refer to a piece of binary data stored in the program s data segments. PostMorph does not analyze or ....
Richard L. Sites and Richard T. Witek, Alpha AXP Architecture Reference Manual (Second Edition), Digital Press, 1995.
....time resolution. For the multithreaded operations being investigated, no meaningful data could be collected with the standard software timing subroutines. There are means of obtaining higher resolution timing measurements. Many processors such as the Intel Pentium [Int97] and the DEC Alpha [SW95] have user visible registers that count processor clock cycles. This suits data collection for single processor systems, but does not provide synchronization of processor clock count register values between multiple processors. Thus, an external time base is required. The National Institute of ....
R. L. Sites and R. T. Witek. Alpha AXP Architecture Reference Manual. Digtial Press, Newton, Massachusett, second edition, 1995.
.... optimizations typically allow the spinning to take place in the local cache to reduce bus traffic [3, 14] More recently, Lock Free synchronization has been widely studied [9, 11] and is included in modern instruction sets, e.g. the DEC Alpha s load locked (ldl l) and store conditional (stl c) [17] (collectively, LL SC) Rather than achieve mutual exclusion by preventing multiple threads from entering the critical section, lock free synchronization prevents more than one thread from successfully writing data and exiting the critical section. A storeconditional only completes successfully ....
....and parallel version of our synchronization efficiency test ffl SMT LL SC. This scheme uses the lock free synchronization currently supported by the Alpha. To implement the ordered access in the benchmark, the acquire primitive is implemented with load locked and store conditional, as given in [17], and the release is a store instruction. ffl SMP . These each use the same primitives as SMT block, but force the synchronization (and data sharing) to occur at different levels in the memory hierarchy (i.e. the L2 cache, the L3 cache, or memory) This mimics the performance of systems with ....
R.L. Sites and R.T. Witek. Alpha AXP Architecture Reference Manual, Second Edition. Digital Press, 1995.
.... optimizations typically allow the spinning to take place in the local cache to reduce bus traffic [3, 15] More recently, Lock Free synchronization has been widely studied [9, 11] and is included in modern instruction sets, e.g. the DEC Alpha s load locked (ldl l) and store conditional (stl c) [18] (collectively, LL SC) Rather than achieve mutual exclusion by preventing multiple threads from entering the critical section, lock free synchronization prevents more than one thread from successfully writing data and exiting the critical section. A store conditional only completes successfully ....
....with blocking acquires using the lock box mechanism. ffl SMT LL SC. This scheme uses the lock free synchronization currently supported by the Alpha. To implement the ordered access in the benchmark, the acquire primitive is implemented with load locked and store conditional, as given in [18], and the release is a store instruction. ffl SMP . These each use the same primitives as SMT block, but force the synchronization (and data sharing) to occur at different levels in the memory hierarchy (i.e. the L2 cache, the L3 cache, or memory) This mimics the performance of systems with ....
R.L. Sites and R.T. Witek. Alpha AXP Architecture Reference Manual, Second Edition. Digital Press, 1995.
No context found.
SITES,R.AND WITEK, R. 1995. Alpha AXP architecture reference manual. Digital Press, Newton, Mass.
.... Kernel sbin loader (modified) loadmap info buffered samples on disk profile database daemon User Space Figure 5: Data Collection System Overview When a performance counter overflows, it generates a high priority interrupt that delivers the PC of the next instruction to be executed [21, 8] and the identity of the overflowing counter. When the device driver handles this interrupt, it records the process identifier (PID) of the interrupted process, the PC delivered by the interrupt, and the event type that caused the interrupt. Our system s default configuration monitors CYCLES and ....
....than CYCLES and IMISS are helpful in tracking down performance problems, but less useful for detailed analysis. 4.1. 3 Blind Spots: Deferred Interrupts Performance counter interrupts execute at the highest kernel priority level (spldevrt) but are deferred while running non interruptible PALcode [21] or system code at the highest priority level. 2 Events in PALcode 2 This makes profiling the performance counter interrupt handler difficult. We have implemented a meta method for obtain and high priority interrupt code are still counted, but samples for those events will be associated with ....
[Article contains additional citation context not shown here]
R. Sites and R. Witek. Alpha AXP Architecture Reference Manual. Digital Press, Newton, MA, 1995.
.... interesting events, including processor clock cycles (CYCLES) instruction cache misses (IMISS) data cache misses (DMISS) and branch mispredictions (BRANCHMP) When a performance counter overflows, it generates a highpriority interrupt that delivers the PC of the next instruction to be executed [21, 8] and the identity of the overflowing counter. When the device driver handles this interrupt, it records the process identifier (PID) of the interrupted process, the PC delivered by the interrupt, and the event type that caused the interrupt. CPU n CPU 1 CPU 0 hash table overflow buffers per cpu ....
....than CYCLES and IMISS are helpful in tracking down performance problems, but less useful for detailed analysis. 4.1. 3 Blind Spots: Deferred Interrupts Performance counter interrupts execute at the highest kernel priority level (spldevrt) but are deferred while running noninterruptible PALcode [21] or system code at the highest priority level. 2 Events in PALcode and high priority interrupt code are still counted, but samples for those events will be associated with the instruction that runs after the PALcode finishes or the interrupt level drops below spldevrt. For synchronous PAL ....
[Article contains additional citation context not shown here]
R. Sites and R. Witek. Alpha AXP Architecture Reference Manual. Digital Press, Newton, MA, 1995.
....from scaling up on the small multiprocessor, and that the cache coherence protocol employed by the machine introduced more cache interference than necessary. 1 Introduction This work was triggered by two performance puzzles (circa 1995) related to the Microsoft SQL Server running on Alpha [SW95] PCs under the Windows NT operating system: how could we speed up the uniprocessor version, and how could we get closer to linear scaling for the multiprocessor version To answer these questions we found that we needed to look at the detailed behavior of the system under load. We created a tool ....
Richard L. Sites and Richard T. Witek. Alpha AXP Architecture Reference Manual. Digital Press, Newton MA, 2nd edition edition, 1995.
No context found.
Richard L. Sites and Richard T. Witek. Alpha AXP Architecture Reference Manual. Digital Press, 1995. ISBN 1-55558-145-5.
No context found.
Richard L. Sites and Richard T. Witek. Alpha AXP Architecture Reference Manual. Digital Press, 1995.
No context found.
R. L. Sites and R.T. Witek. Alpha AXP Architecture Reference Manual. Butterworth -Heinemann, 1995.
No context found.
R. L. Sites and R. T. Witek. Alpha AXP Architecture Reference Manual, Digital Press, Newton, Massachusett, second edition, 1995.
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC