| D. Bhandarkar and J. Ding. Performance characterization of the Pentium Pro processor. In Proceedings of the International Symposium on High-Performance Computer Architecture, pages 288--297, Feb. 1997. |
.... latencies such as those between the on chip cache (L1) and a closely integrated second level cache (L2) However, in a performance study of the Pentium Pro, Bhandarkar and Ding specifically point out that servicing L2 misses was a significant source for the underutilization of machine resources [Bhandarkar 97] Finally, some latency hiding techniques can use both hardware and software support: this is the case for data prefetching. 1.3 Thesis Contributions Most research to date has concentrated on the L1 memory or on the L1 L2 interface. Techniques for reducing the cost of accessing main memory, ....
D. Bhandarkar and J. Ding. Performance Characterization of the Pentium Pro Processor. In 3rd International Symposium on High-Performance Computer Architecture, pages 288--297, February 1997.
....on different platforms to understand how well the application scale with processor advancement. Speech input is captured in files and ran in batch mode in order to control the quality of the samples. We then use an Intel toolset called Vtune to collect built in performance counter values [22][23] Finally we use a full system simulator called SoftSDV together with a cache simulator to study the memory behavior in more depth. 3.1 Real time performance on different We use the indicator xRT (real time ratio) to measure the speed of the speech engine. An xRT of 1 means the time to ....
....than instruction fetch stalls and partial stalls in LVCSR. This is similar to floating point benchmarks. The resource related stalls indicate that there are instructions in the application requiring the same hardware resources such as register renaming buffer entries and memory buffer entries [22]. Branch misprediction recovery and delay in retiring mispredicted branches also causes resourcerelated stalls. Since branch misprediction rate and L2 cache miss rate are not overly high, long dependency chains in the code may have caused these stalls cycles. At the mean time, there is a large ....
D. Bhandarkar and J. Ding, "Performance Characterization of Pentium Pro Processor," Proc. of Symp. On High Performance Computer Architecture, Feb 1-5, 1997, San Antonio pages
....that can be con gured using special instructions to measure hardware events. The events are for instance number of branch instructions executed, number of requests to memory, and number of instructions retired. Using these events the low level behaviour of program code can be characterised [11, 12]. For instance, the ratio between number of branch instructions and mispredicted branch instructions characterises how well a program exploits the branch prediction hardware. 2.2.1 A model for understanding hardware behaviour of database workloads. Another way to use the performance counters is ....
....how well a program exploits the branch prediction hardware. 2.2.1 A model for understanding hardware behaviour of database workloads. Another way to use the performance counters is to break down total execution time into its low level constituents. This is done for the SPEC95 benchmark suite in [12] and more systematically for database management systems in the recent work by Ailamaki [2] Ailamaki shows that database systems do not scale well to the performance o ered by modern CPUs. More than half of the time is spent stalling when executing simple queries. The model she used to obtain ....
Dileep Bhandarkar and Jason Ding. Performance Characterization of the Pentium Pro Processor. In Proceedings of the Third International Symposium on High Performance Computer Architecture, pages 288297, feb 1997.
....Modem servers use a 3 tier approach in which the backend tier handles the database accessing and the front end and the middle tiers implement much of the user interface and portals. Some researchers have studied large database applications, which are usually used as the backend of Intemet servers [1, 2, 5, 12]. While these studies have revealed much about the behavior of backend applications, the behavior of the front and middle tiers of server side workloads is still not fully understood. We attempt with this study to fill some of that knowledge gap by characterizing the impact of the front and ....
....the significant OS activity observed in the server execution also, the whole instruction stream of the Apache server is definitely much more complex and larger than Gcc. It might be noted that high instruction cache miss rates have been observed in traditional database server applications also [ 1, 2]. L1 data cache: The trends of the data cache miss rates in the level one data cache are not very different between the server and the SPECint applications and hence are not shown in detail. The SPECint applications have miss rates in a wide range, with rncf miss rates being very high. The miss ....
[Article contains additional citation context not shown here]
D. Bhandarkar and J. Ding, "Performance Characterization of the Pentium Pro Processor", Proceedings of the 3 High Performance Computer Architecture (HPCA) Symposium, 1997, pp. 288-297.
....measurement) III and POWER3 II, respectively. VolanoMark is run with 1,10, and 30 chat rooms (indicated as vol01, vol10, and vol30) and SPECjbb is run with 1, 10, and 25 warehouses (indicated as jbb1, jbb10, and jbb25) The metrics collected are similar to those collected by Bhandarkar et al. [3]. Table 1. Java servers vs. SPECint2000 (RS64 III) As the tables indicate, VolanoMark spends a high proportion of its execution cycles in kernel mode (os cyc ) This phenomenon is likely due both to the fact that it spends a great deal of time sending and Table 2. Java servers vs. SPECint2000 ....
D. Bhandarkar and J. Ding. Performance Characterization of the Pentium Pro Processor. In Proceedings of the Third International Symposium on High-Performance Computer Architecture, 1997, pp. 288-297.
....and database workloads. Quantitatively, these ILP 26 features are substantially more effective for the image and video benchmarks than for online transaction processing (OLTP) workloads [99] and comparable in benefit to previously reported scientific and decision support system (DSS) workloads [10, 89, 99]. It must be noted that the performance of the in order issue processor is dependent on the quality of the compiler used to schedule the code. Our experiments use the commercial SPARC SC4.2 compiler with maximum optimizations turned on for the in order UltraSPARC processor. To try to isolate ....
Dileep Bhandarkar and Jason Ding. Performance Characterization of the Pentium Pro Processor. In Proceedings of the Third International Symposium on High Performance Computer Architecture, pages 288--297, Feb 1997.
....studied also in earlier work on the paging problem, which used probabilistic analysis (see a survey in [4] There has been some earlier work on the A PS problem, in the study of pipeline processors: the Eager Execution (EE) algorithm (see in Section 5. 1) was shown to perform well in practice ([2, 28]) however no theoretical performance bounds were derived for this algorithm. Uht and Sindagi [28] introduced the Disjoint Eager Execution (DEE) algorithm and presented an empirical study of its performance. Raghavan et al. 22] showed that using Markov decision theory, it is possible to 2 Thus, ....
D. Bhandarkar, J. Ding, "Performance Characterization of the Pentium Pro Processor", Proc. of the 3rd International Symposium on High Performance Computer Architecture, San Antonio, 1997.
....on the RS64III and POWER3 II, respectively. VolanoMark is run with 1,10, and 30 chat rooms (indicated as vol01, vol10, and vol30) and SPECjbb is run with 1, 10, and 25 warehouses (indicated as jbb1, jbb10, and jbb25) The metrics collected are similar to those collected by Bhandarkar et al. [3]. Table 1. Java servers vs. SPECint2000 (RS64 III) As the tables indicate, VolanoMark spends a high proportion of its execution cycles in kernel mode (os cyc ) This phenomenon is likely due both to the fact that it spends a great deal of time sending and Table 2. Java servers vs. SPECint2000 ....
D. Bhandarkar and J. Ding. Performance Characterization of the Pentium Pro Processor. In Proceedings of the Third International Symposium on High-Performance Computer Architecture, 1997.
....The rest of the chapter is organized as follows. Section 4.1 presents the various execution characteristics of commercial multimedia applications (from Table 3.1) on a Pentium II processor with MMX technology. I compare them with existing SPEC and SYSmark NT characteristics presented in [9]. Section 4.2 presents an evaluation of SIMD and VLIW techniques for media and signal processing using a Pentium II and C62xx as representative processors. Section 4.3 summarizes the chapter. 44 4.1 Detailed Characterization of Multimedia Applications I use a Pentium II processor with MMX ....
....1 1.5 2 QuakeII Unreal RealVideo QuickTime Winamp RealAudio (a) CPI CPI 0 0.5 1 1.5 2 SPECint95 Multimedia SPECfp95 SYSmark NT (b) CPI CPI Figure. 4.1. Cycles per instruction (a) for individual multimedia benchmarks and (b) comparison of media applications with other workloads [9] 46 execution units are full; but these stalls may be overlapped with the execution latency of previously executing instructions. The increase in CPI is directly proportional to the sum of I stream and resource stalls as observed in Figure 4.2(a) RealAudio has the highest number of Resource ....
[Article contains additional citation context not shown here]
D. Bhandarkar and J. Ding, "Performance characterization of the Pentium Pro processor," Proc. High Performance Computer Architecture, pp. 288-297, Feb. 1997. 140
....architectures, measuring DRAM impact on total application performance, decomposing the memory access time into different components, and measuring the hit rates in the row buffers. Finally, there are many studies that measure system wide performance, including that of the primary memory system [1, 2, 10, 22, 26, 27, 34, 35]. Our results resemble theirs, in that we obtain similar figures for the fraction of time spent in the primary memory system. However, these studies have different goals from ours, in that they are concerned with measuring the effects on total execution time of varying several CPU level parameters ....
....than SPEC: the middle bars in Figure 10(a) for these benchmarks, which represent CPU speeds of 1GHz, have non overlapped DRAM components constituting 10 25 of the total execution time. This echoes published results for DRAM overheads in commercial workloads such as transaction processing [1, 2, 10, 22]. Another obvious point is that anywhere from 5 to 99 of the memory overhead is overlapped with processor execution the most memory intensive applications successfully overlap 5 20 . SimpleScalar schedules instructions extremely aggressively and hides a fair amount of the memory latency ....
D. Bhandarkar and J. Ding. "Performance characterization of the Pentium Pro processor." In Proc. Third International Symposium on High Performance Computer Architecture (HPCA'97), San Antonio TX, February 1997, pp. 288--297.
....counters. Events in nonprivileged user code (user mode) and privileged operating system code (OS mode) can be counted separately. Our lab developed PMON [14] to access these counters. PMON consists of two parts, a device driver and a control program. The driver reads the performance counters [3,7] of the Pentium III processor while the control program controls the measurement process and logs the results. Since we developed the whole tool ourselves, we have better control over it than any other performance counter tools like Intel s P6Perf. The overhead of PMON is extremely small because ....
.... and tables as volano01, volano10 and volano30) and run SPECjbb2000 with 1, 10 and 25 warehouses (the number of warehouse threads is 1, 10 and 25, shown in figures and tables as jbb01, jbb10 and jbb25) The microarchitectural parameters measured are similar to those measured by Bhandarkar et al. [3]. Table 1 shows the percentage of cycles spent in OS mode. SPECjbb2000 has neither file accesses nor network connections. And since it is a memory resident Java database program, few page faults occur. Therefore, OS cycle time constitutes less than 0.7 of the total execution time, which is not ....
[Article contains additional citation context not shown here]
D. Bhandarkar and J. Ding. Performance Characterization of the Pentium Pro Processor. In Proceedings of The third International Symposium on High-Performance Computer Architecture, 1997.
....high performance scientific applications. They also use source code instrumentation to access hardware performance counters. The classic study [4] compares various overall metrics (CPI, IC, instruction frequencies, etc. for organizationally similar RISC and CISC processors on the SPEC benchmarks. [5] also uses built in performance counters to compare the performance of Pentium and Pentium Pro processors. 5] uses the performance counters to show that the Pentium Pro achieves significantly lower CPI than the original Pentium. 1] use flow and context sensitivity in data flow analysis for ....
....counters. The classic study [4] compares various overall metrics (CPI, IC, instruction frequencies, etc. for organizationally similar RISC and CISC processors on the SPEC benchmarks. 5] also uses built in performance counters to compare the performance of Pentium and Pentium Pro processors. [5] uses the performance counters to show that the Pentium Pro achieves significantly lower CPI than the original Pentium. 1] use flow and context sensitivity in data flow analysis for dynamic program analysis and program profiling. Interesting data structures are used to associate hardware metric ....
Dileep Bhandarkar and Jason Ding. Performance characterization of the pentium pro processor. In Proceedings of the Third International Symposium on High Performance Computer Architecture, 1997.
....ops have completed. The Pentium Pro retires up to three ops per clock cycle, yielding a theoretical minimum cycles per op (CPI) of 0.33. Table 2 5 summarizes the characteristics of the Pentium Pro caches. More detailed descriptions of the Pentium Pro s architectural features can be found in [15] [25] 39] 50] 76] We will also present additional details in subsequent sections, when discussing our measurement results. 2.5.2. Potential Sources of Pentium Pro Stalls In practice, the 0.33 theoretical minimum CPI is seldom achieved, due to stalls from cache misses, oversubscription of ....
....types to monitor two main aggregate stall categories: resource stalls and instruction related stalls. Instruction related stalls count the number of cycles that instruction fetch is stalled for any reason, including L1 instruction cache misses, ITLB misses, ITLB faults, and other minor stalls [15] [50] Resource stalls account for cycles in which the decoder gets ahead of execution. For example, resource stalls encompass the conditions where register renaming buffer entries, reorder buffer entries, memory buffer entries, or execution units are full. In addition, serializing instructions ....
[Article contains additional citation context not shown here]
D. Bhandarkar and J. Ding. "Performance characterization of the Pentium Pro processor," Proc. of HPCA-3, February 1997.
.... (capable of retiring up to three microinstructions per cycle) It implements dynamic execution using an out of order, speculative execution engine, with register renaming of integer, floating point and flag variables, carefully controlled memory access reordering, and multiprocessing bus support [25]. Two integer units, two floating point units, and one memory interface unit allow up to five micro ops to be scheduled per clock cycle. In addition, it provides the MMX execution unit. In our analysis, we used a 300 MHz Pentium II with 16 KB of L1 instruction and data caches and 512 KB of L2 ....
D. Bhandarkar and J. Ding, "Performance characterization of the Pentium Pro processor", Proc. of 3 rd Int. Sym. on High Performance Computer Architecture, pp. 288-297, Feb. 1997.
....Chen et al., used the Pentium s hardware counters to compare the performance of the Microsoft Windows 3. 1, Microsoft Windows 95, and NetBSD operating systems [3] 4] Intel itself recently published a paper comparing the performance of the Pentium and Pentium Pro running various benchmarks [5]. These comparisons contained information that was gathered from the hardware monitoring counters on both processors. However, no information was given on how they actually gathered the data. Presumably, other processors contain this type of hardware, as monitoring counters are an excellent source ....
....instead of numbers to reference certain key aspects of controlling the PMC configuration. These enumerated types are useful for any application using the p5mon device. int main( int P5MON; holds the p5mon device ID unsigned long long int buf[3] 24 byte reading buffer char outbuf[5]; writing buffer double temp1, temp2; holds the results from the buffer Form the buffer to send to the device Since clearing the counters, only need a 5 byte buffer outbuf[0] 0x83; Event 0 is Data Read Miss, outbuf[1] 0; and count user level events ....
D. Bhandarkar and J. Ding, "Performance Characterization of the Pentium Pro Processor," in Proceedings, Third International Symposium on High-Performance Computer Architecture, 1997, p. 288.
.... architectures and applications [33, 1, 32, 22] While such analytical models had fallen out of favor, being replaced by comprehensive simulations, they have recently been enjoying a resurgence due the need to model large scale NUMA machines and the availability of hardware performance counters [18, 7]. However, these models have mainly been used to analyze the performance of various architectures or system level behaviors. That is, they have not been considered as competitive approaches to models such as the BSP. In [2] we propose a cost model that we call F, which is based on values ....
D. Bhandarkar and J. Ding. Performance Characterization of the Pentium Pro Processor. In Proc. of HPCA III, February 1997.
....processors are able to issue up to six instructions per cycle from a single sequential instruction stream [1] VLSI technology will soon allow future microprocessors to issue eight or more instructions per cycle. However, ILP found in a conventional instruction stream is limited. Recent studies [2, 3, 4] show the limits of processor utilization even of today s superscalar microprocessors reporting IPC values between 0.14 and 1.9. One solution to increase performance is an additional utilization of more coarse grained parallelism either by integrating two or more complete processors on a single ....
Bhandarkar, D., Ding, J.: Performance Characterization of the Pentium Pro Processor. 3 rd Int. Symp. on High-Performance Computer Architecture HPCA-3, San Antonio, Feb. 1997.
....(a) and number of resource related stall cycles per stalled instruction for each query (b) from 0.6 (in Q12) to 1.6 (in Q2 and Q17) The average CPI is 1.27. This value is lower than the one found for TPC C [6] 2. 52) and is comparable to the values found for the SPEC95 technical benchmarks [2] (0.5 1.5) While the CPIs of TPC D are relatively modest, it is still important to examine all the potential sources of processor stall cycles. 3.4 Resource Stalls One type of stall cycles are resource related stalls. These stalls occur when there are not enough register renaming resources, ....
D. Bhandarkar and J. Ding. Performance Characterization of the Pentium Pro Processor. In Proceedings of the Third International Symposium on High-Performance Computer Architecture, pages 288-297, February 1997.
....(nonmultimedia applications) By trying to answer the above questions, we provide insight and analysis based on measurements using built in performance counters of the processor. There have been some characterizations of desktop applications running under the Windows NT operating system [1][2] 13] 14] 15] Bhandarkar and Ding [1] characterized the performance of a Pentium Pro processor for both the SPEC benchmarks and the SYSmark NT benchmark suite (contains Word, Excel, Powerpoint, Texim, and MaxEDA) Lee et al. [2] have examined the performance of common desktop applications ....
....By trying to answer the above questions, we provide insight and analysis based on measurements using built in performance counters of the processor. There have been some characterizations of desktop applications running under the Windows NT operating system [1] 2] 13] 14] 15] Bhandarkar and Ding [1] characterized the performance of a Pentium Pro processor for both the SPEC benchmarks and the SYSmark NT benchmark suite (contains Word, Excel, Powerpoint, Texim, and MaxEDA) Lee et al. [2] have examined the performance of common desktop applications (acroread, Netscape, photoshoppe, Powerpoint, ....
[Article contains additional citation context not shown here]
D. Bhandarkar and J. Ding, "Performance Characterization of the Pentium Pro Processor", Proceedings of High Performance Computer Architecture 97, pp. 288-297, Feb 1997.
No context found.
D. Bhandarkar and J. Ding. Performance characterization of the Pentium Pro processor. In Proceedings of the International Symposium on High-Performance Computer Architecture, pages 288--297, Feb. 1997.
No context found.
D. Bhandarkar and J. Ding. "Performance characterization of the Pentium Pro processor." In Proc. of HPCA-3, February, 1997.
No context found.
D. Bhandarkar and J. Ding, "Performance Characterization of the Pentium Pro Processor," Proc. 3rd Int'l Symp. High-Performance Computer Architecture (HPCA-3), IEEE CS Press, Los Alamitos, Calif., 1997, pp. 288 - 297. 177
No context found.
BHA97 D. Bhandarkar, and J. Ding, "Performance Characterization of the Pentium Pro Processor", In Proceedings of the Third International Symposium on High Performance Computer Architecture (HPCA-3), San Antonio, TX, February 1997.
No context found.
D. Bhandarkar and J. Ding. "Performance characterization of the Pentium Pro processor." In Proc. of HPCA-3, February, 1997.
No context found.
D. Bhandarkar and J. Ding, "Performance Characterization of the Pentium Pro Processor", Proc. HPCA-3, pp. 288-297, Feb. 1997.
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC