| T. Mathisen. Pentium Secrets. Byte, pages 191--192, July 1994. |
....3.3) the resulting decay interval is on average 3.5 times the period of the global counter. In our power evaluations, we assume that the global counter will come for free, since many processors already contain various cycle counters for the operating system or for performance counting [14, 50, 81]. If such counters are not available, a simple N bit binary ripple counter could be built with 40N 20 transistors, of which few would transition each cycle. To minimize state transitions in the local 2 bit cache line counters and thus minimize dynamic power consumption we use Gray coding so ....
T. Mathisen. Pentium Secrets. Byte, pages 191--192, July 1994.
....events. Ob servable events typically include cache misses hits, bus trans actions, branch mispredictions, and instruction retirement. Such event counters are used in commercial performance monitoring toolsets [9] and are also employed in more adhoc ways to track subtle performance bugs [12]. The key advantage of hardware performance counters is that instead of relying on simplified performance simulations, programmers can evaluate the impact of their final optimizations on real hardware. The capabilities provided by event counter mechanisms varies across processor families and ....
T. Mathisen. Pentium secrets. Byte Magazine, pages 191-192, July 1994.
....P6 family, although the magnitude of the tradeo s will vary. 2.2 Performance Measurements In order to perform detailed performance power tradeo studies, we need extensive performance information about the programs being run. Towards that end we employed the CPU s hardware performance counters [9]. These performance debugging aids are nearly ubiquitous in modern microprocessors. They can typically be used to tabulate important processor level events like cache hits misses, branch mispredictions, or instruction retirement. In our case, the speci c counters that were the most useful were: ....
T. Mathisen. Pentium secrets. Byte Magazine, pages 191-192, July 1994.
....except for the first and last iteration, the branch will be correctly predicted, and will execute in (at most) a single clock cycle. The fourth item will only concern us as far as it allows for faster cache line fills of the Pentium s two internal caches. The last, largely undocumented feature [Mat94] will allow us to monitor the extent of our improvements. The Pentium processor allows to execute two instructions in parallel through two five stage pipelines, called the U pipe and the V pipe. The processor always issues the first instruction of the pair to the U pipe. The second instruction of ....
T. Mathisen, "Pentium Secrets," Byte, Vol. 19, No. 7, July 1994, pp. 191-- 192.
....and it also has four breakpoint registers for establishing breakpoints. Although some of these features are not documented, and are only available through a nondisclosure agreement with Intel, Pentium debugging and performance monitoring features have been reverse engineered and published in [28]. The Alpha AXP architecture [29] is a 64 bits load store RISC architecture designed with particular emphasis on clock speed, multiple instruction issue and software migration. Although its debugging facilities are reduced, it includes performance monitoring features like several registers to ....
T.Mathisen, "Pentium Secrets", BYTE magazine, pp. 191-192, July, 1994.
....with each path. Digital s Continous profiling [3] combines hardware measurement with operating system support to measure time spent both in application programs, as well as executing operating system services. Hardware measurement facilities are included with many micro processors today [1] [17], 28] All of these sequential techniques complement Critical Path Profiling. In most cases, they can be used after Critical Path has isolated a performance problem to a specific procedure. 7CONCLUSION We have presented an online algorithm to compute the critical path profile of a parallel ....
# T. Mathisen, "Pentium Secrets," Byte, vol. 19, no. 7, pp. 191-192, 1994.
....less flexible histogramming, and could not categorize statistics based on data regions or interrupt the processor based on a user set threshold. On chip performance monitors are becoming more common for CPU chips. For example, Intel s Pentium line of CPUs incorporate extensive on chip monitoring [17]. The Pentium performance counters include information on the number of reads and writes, the number of read misses and write misses, pipeline stalls, TLB misses, etc. Here also, there is no support for categorization of statistics or for selective CPU notification. In contrast, the Alpha 21064 ....
T. Mathisen. Pentium Secrets. Byte, pages 191--192, July 1994.
....Intel released the Pentium processor, they merely mentioned that there were several performance monitoring counters in the Pentium, but did not disclose how to access them or what they measured. However, Terje Mathisen reverse engineered this information and published the results in BYTE Magazine [1]. Following this, Intel released the Pentium s performance monitoring hardware information. They also disclosed this information for both the Pentium Pro and the Pentium with MMX processors when they were released. 4 Two papers have been published concerning the use of the Pentium s performance ....
.... int buf[3] 24 byte reading buffer char outbuf[5] writing buffer double temp1, temp2; holds the results from the buffer Form the buffer to send to the device Since clearing the counters, only need a 5 byte buffer outbuf[0] 0x83; Event 0 is Data Read Miss, outbuf[1]=0; and count user level events outbuf[2] 0x80; Event 1 is Data Read, outbuf[3] 0; and count user level events outbuf[4] 3; want to clear both counters Open the device using the open system call P5MON = open( dev p5mon , O RDWR) if(P5MON = 0) printf( opening ....
[Article contains additional citation context not shown here]
T. Mathisen. "Pentium Secrets," BYTE Magazine, pp. 191-192, July 1994.
....of our real time communication mechanism. The measurements are done on two IBM PC AT compatible machines. Each machine has 166MHz Intel Pentium processor and 32 megabytes of memory and 3Com EtherLinkIII(ISA, 3c509) for Ethernet interface. We used RDTRC (read time stamp counter) instruction[9] on the Pentium processor for measurements. To show the effect of the protocol processing mechanism by the different implementation, we have measured four implementations. The first one is the inkernel protocol processing, and we used FreeBSD 2.2.1RELEASE. It is represented as FreeBSD in the ....
T. Mathisen. Pentium secrets. Byte magazine. http://www.byte.com/art/9407/sec12/art3.htm.
....execution times to compare thread scheduling overheads of these operating systems. Our experimental machine was a PC with an Intel 100 MHz Pentium processor and 256K bytes of secondary cache. We made our measurements using a 64 bit cycle counter implemented in the Intel Pentium processor [9]. 5.1 Upcall Performance Since the Arx kernel relies heavily on upcalls, it is extremely important to minimize the upcall latency. Figure 8 shows the typical upcall performance in the Arx kernel with a typical user level scheduler implementing fixed priority scheduling with the round robin ....
T. Mathisen. Pentium secrets. Byte, pages 191--192, July 1994.
....the method cannot be directly used to observe the system s dynamics or task interactions in a multi tasking environment. A number of recent microprocessors have internal registers that indicate the number of elapsed clock cycles from the system startup. An example is Intel s Pentium processor [9]. Timing measurement using such a register can provide an exact and high precision timing information about the system operation. However, such a method has two limitations to be used in instrumenting real time software. First, the method is entirely dependent on the target processor, and second, ....
T. Mathisen. Pentium secrets. Byte, pages 191--192, Jul. 1994.
....for gaining deeper insight into application performance and for pinpointing performance bottlenecks. PMCs were first used extensively on Cray vector processors, and appear in some form in all modern microprocessors, such as the MIPS R10000 (see [5] 13] 14] 17] Intel Pentium (see [7] 8] [12]) IBM PowerPC (see [16] DEC Alpha (see [2] and HP PA 8000 (see [6] Most of the microprocessor vendors provide hardware developers and selected performance analysts with documentation on counters and counter based performance tools. Useful information regarding PMCs can be found on the ....
....direct calls to the libperfex routines cannot be mixed. Chapter 6 Event Counting on Pentium Processors The Pentium processor family provides two 40 bit PMCs, making it possible to monitor two types of events simultaneously. These counters can either count events or measure duration (see [7] 8] [12]) When counting events, a counter is incremented each time a specified event takes place or a specified number of events takes place. When measuring duration, a counter counts the number of processor clock cycles that occur while a specified condition is true. The counters can count events or ....
T. Mathisen, Pentium Secrets, Byte Magazine, July 1994, pp. 191--192.
....aggregate execution times to compare thread scheduling overheads of these operating systems. Our experimental machine was a PC with an Intel 100 MHz Pentium processor and 256K bytes of secondary cache. We made our measurements using a 64 bit cycle counter implemented in the Intel Pentium processor [9]. 5.1 Basic Performance In this section, we compare the performance of ARX with QNX version 4.23A. QNX is a commercial real time operating system which is widely used. Table 5 shows the basic performance numbers of ARX and QNX. The typical scheduling latency in ARX is 8.09 s which is much ....
T. Mathisen. Pentium secrets. Byte, pages 191--192, July 1994.
....offered less flexible histogramming, and could not categorize statistics based on data regions or interrupt the processor based on a user set threshold. On chip performance monitors are becoming more common for CPU chips. For example, Intel s Pentium CPU incorporates extensive on chip monitoring [18]. The Pentium performance counters include information on the number of reads and writes, the number of read misses and write misses, pipeline stalls, TLB misses, etc. Here also, there is no support for categorization of statistics or for selective CPU notification. In contrast, the Alpha 21064 ....
T. Mathisen. Pentium secrets. Byte, pages 191-- 192, July 1994.
....well configured cluster, and thus could gauge worthwhile hardware purchases for the entire system. Such models could be easily constructed from a more comprehensive set of hardware counters at all levels of the machine. Though most modern processors have a reasonable set of performance counters [12, 22] that have been shown to be useful for detailed performance profiling [27] other components of the machine are ignored. For example, researchers have shown that network packet counters can be extremely useful [8, 20] However, just monitoring in coming and out going packets is not enough. ....
T. Mathisen. Pentium Secrets. Byte, pages 191--192, July 1994.
....except for the first and last iteration, the branch will be correctly predicted, and will execute in (at most) a single clock cycle. The fourth item will only concern us as far as it allows for faster cache line fills of the Pentium s two internal caches. The last, largely undocumented feature [Mat94] will allow us to monitor the extent of our improvements. The Pentium processor allows to execute two instructions in parallel through two five stage pipelines, called the U pipe and the V pipe. The processor always issues the first instruction of the pair to the U pipe. The second instruction of ....
T. Mathisen, "Pentium Secrets," Byte, Vol. 19, No. 7, July 1994, pp. 191-- 192.
....not work with general time sharing applications, since they cannot affect real time processes. Our experimental machine was a PC with an Intel 133 MHz Pentium processor and 32M bytes of main memory. We made our measurements using a 64 bit cycle counter implemented in the Intel Pentium processor [11]. Being initialized at the bring up time, this cycle counter increments every machine cycle. In our experimental machine, 133,000 cycles equals to 1ms. We set 2ms to the CPU time quantum and 1ms to the interval timer. During the experiments, our objectives were to (1) carefully measure the upcall ....
T. Mathisen. Pentium secrets. Byte, pages 191--192, July 1994.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC