| Intel Corp. Pentium Pro Family Developer's Manual, 1996. Volume 1: Speci cations. |
....these locks demonstrate 100 a simple silent store pair pattern (Section 3.6.2) By doing so, SLE does not rely on the knowledge of locks and can use silent store pair predictors The simplicity and portability of test test set locks make them quite popular. Hardware architecture manuals recommend [28, 31, 54, 73] and database vendors are advised [83] to use these simple locks as portable locking mechanisms. The POSIX threads standard recommends synchronization be implemented in library calls such as pthread mutex lock( and these calls implement the test set or test test set locks. While test set based ....
Intel Corporation. Pentium Pro Family Developer's Manual, Volume 3: Operating System Writer's Manual, January 1996.
....rather frequently) The prototype is based on LINUX proc file mechanism to write performance data. However, the information available in the proc file are significantly extended. The framework of performance modeling framework uses the hardware performance counters of the Pentium processor [10] that are made accessible through a library we have implemented ourselves. All information for performance analysis is gathered at every node of our distributed system in parallel and the performance modeling framework puts it all together into a global view of the parallel system. The monitoring ....
Intel Corporation. Pentium Pro Family Developer's Manual, 1996.
.... to account for possible clock skews, and using microtime( generates the overhead of the system call (approximately 450 nanoseconds on a PentiumPro 200 MHz [10] A more efficient solution is to directly read the timestamp counter (TSC) register available in the Pentium series processors [14], and compatible architectures such as recent AMD processors (e.g. Athlon) This register is an unsigned 64 bit precision integer, and gives the number of cycles elapsed since the machine has been turned on. Thus, the resolution of the counter is much finer than that provided by microtime( A ....
Intel Corporation. Pentium Pro Family Developer's Manual. Volume III: Operating System Writer's Guide. 1995.
....TEST SET operation. An example implementation of the TEST TEST SET sequence is shown in Figure 2. While numerous lock constructs, both hardware and software, have been proposed, the simplicity and portability of TEST TEST SET locks make them quite popular. Hardware architecture manuals recommend [8, 10, 33, 18], and database vendors are advised [22] to use these simple locks as a portable locking mechanism (of course, a few other software primitives are also used when circumstances dictate their use) The POSIX threads standard recommends synchronization be implemented in library calls such as ....
Intel Corporation. Pentium Pro Family Developer's Manual, Volume 3: Operating System Writer's Manual, January 1996.
....SC is straightforward, because the existing mechanisms are already sufficient. In the next section, we will explore the ramifications of adding value prediction to systems that exploit relaxed consistency models. 3 Value Prediction Relaxed Memory Models Many common instruction set architectures [18, 19, 20, 30, 33] do not require the strict semantics of sequential consistency. These systems are said to implement relaxed memory consistency models. Relaxed memory models allow the hardware to potentially employ optimizations such as store queues and write buffers, and they can simplify the implementation of ....
....store buffers by relaxing the order from a thread s write to its subsequent reads. The other class, generally referred to as weakly ordered models, allows much more reordering of reads and writes. 3. 1 Processor Consistency PC models, such as SPARC Total Store Order (TSO) 33] and IA 32 [19], allow relaxation of the order from a thread s write to its subsequent reads. Since PC models do not allow relaxation of read to read program order, simple implementations must, in our example, execute r1 and r2 in program order. If, on the other hand, a more sophisticated implementation allows ....
Intel Corporation. Pentium Pro Family Developer's Manual, Volume 3: Operating System Writer's Manual, Jan. 1996.
....will increase in future microprocessors. Proposed solutions to this growing TLB performance bottleneck range from changing the TLB structure to retain more of the working set (e.g. multi level TLB hierarchies [1, 16] to implementing better management policies (in software [21] or hardware [20]) to masking TLB miss latency by prefetching entries (again, in software [4] or hardware [41] All of these approaches can be improved by exploiting superpages. Most commercial TLBs support superpages, and have for several years [30, 43] but more research is needed into how best to make ....
Intel Corporation. Pentium Pro Family Developer's Manual, Jan. 1996.
....over a switched network. All experiments were run on Linux 2.2.12. 2.3. Characterizing Performance After casting our communication systems into the LogGP model, we break down the LogGP parameters into their architectural costs. Our approach is to use the processor s hardware event counters [24] to charge various hardware events to each parameter of the LogGP model. Specifically, we measure the following events: The number of instructions decoded. 3 P M P M P M Interconnection network P (processors) L (latency) g (gap) limited capacity (L g to or from a proccessor) o ....
Intel Corporation, Santa Clara, CA. Pentium Pro family developer's manual, volume 3: Operating system writer's manual, 1996. Order number 242692.
....RDMSR and WRMSR instructions to program and reset the counters. Therefore, it is necessary to write some code that runs in kernel mode. For maximum flexibility and ease of installation, we wrote a loadable device driver module rather than modifying the booted kernel. See section 10.6. 1 of [8] for PerfEvtSelx programming details and appendix A of [7] for the description of the events available for counting. 1 after a kernel mode instruction sets a model specific register bit. 5 3.3 Optimized Instrumentation In this section we will describe a few steps we have taken in order to ....
Intel Corporation. Pentium pro family developer's manual, order number 242692.
....driver over a switched network. All experiments were run on Linux 2.2.12. 2.3 Characterizing Performance After casting our communication systems into the LogGP model, we break down the LogGP parameters into their architectural costs. Our approach is to use the processor s hardware event counters [24] to charge various hardware events to each parameter of the LogGP model. Specifically, we measure the following events: # The number of instructions decoded. # The number of external bus acceses to memory space 1 . 1 The memory bus is only critical when the cache is not large enough to hold ....
Intel Corporation, Santa Clara, CA. Pentium Pro family developer 's manual, volume 3: Operating system writer's manual, 1996. Order number 242692.
....about 400,000 64 byte packets per second even under loads much higher than those we tested. 8.4 cpu time breakdown Table 8.1 breaks down the CPU time cost of forwarding a packet through the baseline Click IP router of Figure 5.1. Costs were measured in nanoseconds by Pentium III cycle counters [23]. Each measurement is the accumulated cost for all packets in a 10 second run divided by the number of packets forwarded. These measurements are larger than the true values as using Pentium III cycle counters has significant cost. Most of the tasks performed by a Click router s CPU are included in ....
Intel Corporation. Pentium Pro Family Developer's Manual, Volume 3, 1996. http://developer.intel.com/design/pro/manuals.
....only after all previous instructions have been retired, and all of the instruction s constituent ops have completed. The Pentium Pro retires up to three ops per clock cycle, yielding a theoretical minimum cycles per op (CPI) of 0.33. More information on the Pentium Pro can be found in [6] 11] [13] [21] 34] Measurements were performed using the Pentium Pro hardware counters [13] We present aggregate (user operating system) activity, factoring out the idle loop. On the uniprocessor, this technique is possible because NT implements the idle loop using the HALT instruction. The event ....
....constituent ops have completed. The Pentium Pro retires up to three ops per clock cycle, yielding a theoretical minimum cycles per op (CPI) of 0.33. More information on the Pentium Pro can be found in [6] 11] 13] 21] 34] Measurements were performed using the Pentium Pro hardware counters [13]. We present aggregate (user operating system) activity, factoring out the idle loop. On the uniprocessor, this technique is possible because NT implements the idle loop using the HALT instruction. The event counters are inactive during this idle loop, ensuring that we can reliably separate system ....
Intel Corporation.Pentium Pro family developer's manual, volume 3: Operating system writer's manual. Intel Corporation, 1996, Order number 242692.
....was measured for both Chapter 3 and Chapter 4. The I O subsystems of these configurations are shown in Figure 2 2 through Figure 2 4. 18 sors active. Finally, the number of outstanding bus transactions was varied by changing a BIOS parameter to limit the I O queue depth of the controller [50]. Ideally, we would like to limit each processor to a single outstanding bus transaction, to explore the effects of the non blocking L2 cache. The BIOS only allows us, however, to limit the overall system (in other words, all four processors) to a single outstanding bus transaction. Thus, we ....
....The Pentium Pro retires up to three ops per clock cycle, yielding a theoretical minimum cycles per op (CPI) of 0.33. Table 2 5 summarizes the characteristics of the Pentium Pro caches. More detailed descriptions of the Pentium Pro s architectural features can be found in [15] 25] 39] [50] [76] We will also present additional details in subsequent sections, when discussing our measurement results. 2.5.2. Potential Sources of Pentium Pro Stalls In practice, the 0.33 theoretical minimum CPI is seldom achieved, due to stalls from cache misses, oversubscription of certain resources, ....
[Article contains additional citation context not shown here]
Intel Corporation. Pentium Pro family developer's manual, volume 3: Operating system writer's manual. Intel Corporation, 1996, Order number 242692.
....traps will increase in future microprocessors. Proposed solutions to this growing TLB performance bottleneck range from changing the TLB structure to retain more of the working set (e.g. multi level TLB hierarchies [1, 8] to implementing better management policies (in software [10] or hardware [9]) to masking TLB miss latency by prefetching entries (again, in software [2] or hardware [25] All of these approaches can be improved by exploiting superpages. Most commercial TLBs support superpages, and have for several years [16, 28] but more research is needed into how best to make ....
Intel Corporation. Pentium Pro Family Developer's Manual, Jan. 1996.
.... explicitly or not (e.g. programmer centric models [16, 5, 19] It is the task of the compiler to ensure that the semantics of a high level program is preserved when its compiled version is executed on an architecture with a certain low level memory model (e.g. architecture centric models [25, 18, 26, 14]) The essence of any memorymodel is the correspondencebetween each load instruction and the store instruction that supplies the value retrieved by the load. Unfortunately, at the architecture level, memory access operations often have some sophisticated implementation characteristics that make it ....
Intel, editor. Pentium Pro Family Developer's Manual, Volume 3: Operating System Writer's Manual. Intel Corporation, 1996.
....control packets and data packets can be pipelined and use separate busses. RDRAM address remapping [4] was modeled to reduce the rate of bank interference. The peak bandwidth that can be reached in our RDRAM model is 1.6GB sec. A simpli ed uncacheable write combining (or write coalescing) memory [2][3] was implemented as well for the purpose of correctly simulating our benchmark behavior. Whenever a data write to an uncacheable region results in an L1 cache miss, the write operation will immediately request access to the bus and drive data out to the system memory directly (skipping a ....
Intel Corporation. Pentium Pro Family Developer's Manual, volume 3: Operating System Writer's Manual. Intel Literature Centers, 1996.
....running the tulip driver. All experiments were run on Linux 2.2.12. 2.3 Characterizing Performance After casting our communication systems into the LogGP model, we break down the LogGP parameters into their architectural costs. Our approach is to use the processor s hardware event counters [24] to charge various hardware events to each parameter of the LogGP model. Speci cally, we measure the following events: 1 We expect to get machines with 750 MHz AMD Athlon CPUs and 133 MHz system bus soon so will have measurements for 5 clock speeds for the nal paper. 5 The number of ....
Intel Corporation. Pentium Pro family developer's manual, volume 3: Operating system writer's manual. Santa Clara, CA, 1996. Order number 242692.
....multiprocessor systems, using unmodified binary programs. To measure memory behavior, we wrote a memory system model that simulates a two level cache hierarchy and a cycle accurate multiprocessor split transaction bus. The bus protocols in our memory model are based on the Pentium II MESI protocol [7] and are tuned for characteristics of processors a few years in the future. Simics sends each memory request to our memory model, which analyzes the effects of the requests and sends the timing information back to Simics. To prevent our results from being skewed, the memory model detects and ....
Intel Corporation. Pentium Pro Family Devel- oper's Manual, Volume 1: Specification, 1996.
....which the processor maps logical registers into physical locations. Register renaming is used to remove register anti dependencies and output dependencies and to recover from control speculation. The basic register renaming mechanism is well known and widely used (e.g. Intel Pentium Pro Processor [Inte96]) This section presents the most advanced combined register renaming and dependencytracking scheme involving three structures: a Free List (FL) a Register Alias Table (RAT) and an Active List (AL) This scheme has been used in the MIPS R10000 and DEC 21264. The RAT maintains the latest ....
Intel Corporation, Pentium Pro Family Developer's Manual. Volume 2: Programmer's Reference Manual, 1996.
No context found.
Intel Corp. Pentium Pro Family Developer's Manual, 1996. Volume 1: Speci cations.
No context found.
Intel Corporation.Pentium Pro family developer's manual, volume 3: Operating system writer's manual. Intel Corporation, 1996, Order number 242692.
No context found.
Intel Corporation. Pentium Pro Family Developer's Manual, 1996.
No context found.
Intel Corporation. Pentium Pro Family Developer's Manual. Volume III: Operating System Writer's Guide, 1995.
No context found.
Intel Corporation. Pentium Pro Family Developer's Manual. Palo Alto, CA USA, Jan. 1996.
No context found.
Intel Corporation, editor. Pentium Pro Family Developer 's Manual, chapter 7.4.15. Intel, December 1995.
No context found.
Intel Corporation. Pentium Pro Family Developer's Manual, Volume 3: Operating System Writer's Manual, January 1996.
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC