| R. J. Eickemeyer, R. E. Johnson, S. R. Kunkel, M. S. Squillante, and S. Liu. Evaluation of multithreaded uniprocessors for commercial application environments. In 23rd Annual International Symposium on Computer Architecture, pages 203--212, May 1996. |
....organizations. Later studies were trace based. Some researchers relied on intrusive instrumentation of the OS and user level workloads [16, 48] to obtain traces; while such instrumentation can capture all memory references, it perturbs workload execution [16] Other studies employed bus monitors [26], which have the drawback of capturing only memory activity reaching the bus. To overcome this, some have used a combination of instrumentation and bus monitors [78, 88, 79, 14] As an example of more recent studies, Torrellas, Gupta, and Hennessy [78] measured L2 cache misses on an SMP of MIPS ....
EICKEMEYER, R. J., JOHNSON, R. E., KUNKEL, S. R., SQUILLANTE, M. S., AND LIU, S. Evaluation of multithreaded uniprocessors for commercial application environments. In Proceedings of the International Symposium on Computer Architecture (May 1996).
....first to study the performance of memory consistency models in the context of database workloads. There are a number of studies based on the performance of out of order processors for non database workloads(e.g. 42, 85, 89] Most previous studies of databases are based on in order processors [9, 27, 28, 34, 75, 94, 118, 119], and therefore do not address the benefits of more aggressive processor architectures. A number of the studies are limited to 102 uniprocessor systems [27, 34, 72, 75] As discussed in Section 4.4, data communication misses play a more dominant role in multiprocessor executions and somewhat ....
....non database workloads(e.g. 42, 85, 89] Most previous studies of databases are based on in order processors [9, 27, 28, 34, 75, 94, 118, 119] and therefore do not address the benefits of more aggressive processor architectures. A number of the studies are limited to 102 uniprocessor systems [27, 34, 72, 75]. As discussed in Section 4.4, data communication misses play a more dominant role in multiprocessor executions and somewhat reduce the relative effect of instruction stall times. Another important distinction among the database studies is whether they are based on monitoring existing systems [3, ....
[Article contains additional citation context not shown here]
Richard J. Eickemeyer, Ross E. Johnson, Steven R. Kunkel, Mark S. Squillante, and Shiafun Liu. Evaluation of Multithreaded Uniprocessors for Commercial Applica- 120 tion Environments. In Proceedings of the 21th Annual International Symposium on Computer Architecture, pages 203--212, June 1996.
....multiprocessing (MP) 10] From the software s perspective, hardware multithreading and multiprocessing are the same, and we treat them similarly in this paper. These techniques have been shown to improve performance substantially for important applications such as database workloads [4, 14, 27], web workloads [7, 28] and desktop applications [15] This paper explores the correctness issues that arise from the interaction between these two techniques. To date, most value prediction research has assumed a singlethreaded uniprocessor system and has ignored multithreading and ....
R. J. Eickemeyer, R. E. Johnson, S. R. Kunkel, M. S. Squillante, and S. Liu. Evaluation of Multithreaded Uniprocessors for Commercial Application Environments. In Proceedings of the 23th Annual International Symposium on Computer Architecture, pages 203--212, May 1996.
....These techniques reduce data cache misses, and are orthogonal to the goal of CGP which tries to reduce I cache misses. CGP may be implemented on top of these cache conscious algorithms. It is only recently that researchers have examined the performance impact of architectural features on DBMSs [1, 12, 25, 10, 19, 9, 11, 14]. Their results show that database applications have large instruction and data footprints and exhibit more unpredictable branch behavior than benchmarks that are commonly used in architectural studies (e.g. SPEC) Database applications have fewer loops and suffer from frequent context switches, ....
R. Eickemeyer, R. Johnson, S. Kunkel, M. Squillante, and S. Liu. Evaluation of Multithreaded Uniprocessors for Commercial Application Environments. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, pages 203--212, May 1996.
....organizations. Later studies were trace based. Some researchers relied on intrusive instrumentation of the OS and user level workloads [7, 25] to obtain traces; while such instrumentation can capture all memory references, it perturbs workload execution [7] Other studies employed bus monitors [12], which have the drawback of capturing only memory activity reaching the bus. To overcome this, some have used a combination of instrumentation and bus monitors [5, 39, 46, 40] As an example of more recent studies, Torrellas, Gupta, and Hennessy [39] measured L2 cache misses on an SMP of MIPS ....
R. J. Eickemeyer, R. E. Johnson, S. R. Kunkel, M. S. Squillante, and S. Liu. Evaluation of multithreaded uniprocessors for commercial application environments. In 23nd Annual International Symposium on Computer Architecture, May 1996.
....a Sequent cache coherent shared memory multiprocessor and highlighted the importance of process scheduling and the I O capability of the machine. Maynard et al. [7] contrasted the cache performance of technical and commercial workloads and concluded that the latter is often worse. Eickemeyer et al. [4] showed that a significant performance improvement can be obtained for OLTP workloads when a multithreaded processor is used. Finally, other studies that have involved database workloads include the work by Cvetanovic and Bhandarkar [1] on a DEC Alpha AXP system, Torrellas et al. [12] on an SGI ....
Richard J. Eickemeyer, Ross E. Johnson, Steven R. Kunkel, Mark S. Squillante, and Shiafun Liu. Evaluation of Multithreaded Uniprocessors for Commercial Application Environments. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, pages 203--212,
....Keywords: Database, transaction processing, decision support, microbenchmark, and performance evaluation. 1. INTRODUCTION In the last five to ten years, several studies have explored the architectural characteristics of online transaction processing (OLTP) database workloads [3] 7] 8] [9] [16] 17] 18] 19] 22] 23] 24] 26] 27] 1 This work was performed as part of the author s dissertation research. The author s present address is: Storage Systems Program, Hewlett Packard Laboratories, 1501 Page Mill Road, M S 1U 13, Palo Alto, CA 94304 1126. Her current email address is ....
....we believe that the most promising approach for producing a representative random microbenchmark lies in posing multiple read only queries. 5. RELATED WORK Many of the studies that use database workloads to evaluate computer architecture innovations have employed the complex OLTP [3] 7] 8] [9] [16] 17] 18] 19] 22] 23] 24] 26] 27] 33] and DSS [3] 5] 15] 17] 18] 23] 31] workloads defined by the TPC. These studies vary in their usage of full scale data sets versus in memory data sets. One study provides rules of thumb for using an in memory version of the TPC B OLTP ....
R. J. Eickemeyer, R. E. Johnson, S. R. Kunkel, M. S. Squillante, and S. Liu. "Evaluation of multithreaded uniprocessors for commercial application environments." In Proc. of the 21st ISCA, June 1996, pp. 203 - 212.
....has already been referenced in earlier sections. We further discuss some of the previous work pertinent to database workloads and CMP in this section. There have been a large number of recent studies of database applications (both OLTP and DSS) due to the increasing importance of these workloads [4,7,8,12,21,27,28,34,35,36,42,46]. To the best of our knowledge, this is the first paper that provides a detailed evaluation of database workloads in the context of chip multiprocessing. Ranganathan et al. 35] study user level traces of database workloads in the context of wide issue out of order processors, and show that the ....
R. J. Eickemeyer, R. E. Johnson, S. R. Kunkel, M. S. Squillante, and S. Liu. Evaluation of Multithreaded Uniprocessors for Commercial Application Environments. In 23rd Annual International Symposium on Computer Architecture, pages 203--212, May 1996.
....in particular, shows that SMT s latency tolerance makes SMT an extremely strong candidate architecture for future database servers. 6 Related work We are aware of only one other study that has examined the performance of commercial workloads on multithreaded architectures. Eickemeyer, et al. [5] used trace driven simulation to evaluate the benefits of coarsegrain multithreading for TPC C on the OS 400 database. By using two or three threads, throughput increased by 70 ; but with more than 3 threads, no further gains were achieved. Because their coarse grain architecture only switched ....
R. Eickemeyer, et al. Evaluation of multithreaded uniprocessors for commercial application environments. In 23rd Ann. Int'l Symp. on Computer Arch., p. 203--212, May 1996.
....trade offs that arise in the integration of various systemlevel modules onto the processor chip, and quantifying the performance gains from such integration in the context of OLTP workloads. There have been a large number of recent studies of OLTP due to the increasing importance of this workload [1, 2, 3, 5, 8, 11, 12, 15, 16, 18]. Many of these studies emphasize the importance of memory system behavior on OLTP performance. Barroso et al. 1] provide performance results for various off chip L2 cache sizes, and recommend the use of large (8MB) direct mapped offchip caches. This recommendation is consistent with our obser ....
R. J. Eickemeyer, R. E. Johnson, S. R. Kunkel, M. S. Squillante, and S. Liu. Evaluation of multithreaded uniprocessors for com- mercial application environments. In Proceedings of the 21th Annual International Symposium on Computer Architecture, pages 203--212, June 1996.
....a processor to tolerate cache miss latency by attempting to keep the execution units busy while a cache miss is serviced. A conceptually similar approach is taken by coarse grained multi threaded processors, which can switch between independent threads of execution when a cache miss is detected [5]. 2.3 Use the available memory bandwidth more effectively Another way to reduce latency is to control the store traffic to the lower levels of the memory hierarchy. Reducing store traffic allows the lower levels of the memory hierarchy to concentrate on servicing demand misses, which must be ....
....runahead instructions. At this point the register sources r5, r6, and r7 are still unaccounted for. As these register values were not computed during the runahead episode, they must have been computed during normal operation. The simulator records this by incrementing prefetch regs[non runahead][5], prefetch regs[non runahead] 6] and prefetch regs[non runahead] 7] Note that r31 is hard wired to the value zero. We do not trace uses of r31 for this reason. Compiler register usage conventions are provided in Table 5.5. load r0, 0(r2) lda r3, 32(r4) sll r1, r4, r5 add r4, r1, r2 mult ....
[Article contains additional citation context not shown here]
Richard Eickemeyer, Ross Johnson, and Steven Kunkel, "Evaluation of Multithreaded Uniprocessors for Commercial Application Environments," In the Proceedings of the International Symposium on Computer Architecture, 1996.
....although I O can be a major bottleneck, the processor is stalled 50 of the time due to cache misses when running OLTP workloads. In the past two years, several interesting studies evaluated database workloads, mostly on multiprocessor platforms. Most of these studies evaluate OLTP workloads [4][13] 10] a few evaluate decision support (DSS) workloads [11] and there are some studies that use both [2] 16] All of the studies agree that the DBMS behavior depends upon the nature of the workload (DSS or OLTP) that DSS workloads benefit more from out oforder processors with increased ....
R. J. Eickemeyer, R. E. Johnson, S. R. Kunkel, M. S. Squillante, and S. Liu. Evaluation of multithreaded uniprocessors for commercial application environments. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996.
....that, although I O can be a major bottleneck, the processor is stalled 50 of the time due to cache misses when running OLTP workloads. In the past two years, several interesting studies evaluated database workloads, mostly on multiprocessor platforms. Most of these studies evaluate OLTP workloads [4][13] 10] a few evaluate DSS workloads [11] and there are some studies that use both [2] 16] All of the studies agree that the DBMS behavior depends upon the workload, that DSS workloads benefit more than OLTP from out of order processors with increased instruction level parallelism, and that the ....
R. J. Eickemeyer, R. E. Johnson, S. R. Kunkel, M. S. Squillante, and S. Liu. Evaluation of multithreaded uniprocessors for commercial application environments. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996.
.... along with academic research in this area, have been heavily influenced by popular scientific and engineering benchmarks such as SPLASH 2 [26] and STREAMS [13] with only a handful of published architectural studies that have in some way tried to address issues specific to commercial workloads [3, 7, 9, 12, 14, 16, 20, 21, 24]. The lack of architectural research on commercial applications is partly due to the fact that I O issues have been historically considered as the primary performance bottleneck for such workloads. However, innovations in disk subsystems (RAID arrays, use of non volatile memory) and software ....
....that are based on the the behavior of TPC B and TPC D Q6. Although they agree with our observation that TPC B memory behavior is representative of TPC C, the limitations of their methodology do not allow them to study memory system performance in the level of detail done here. Eickemeyer et al. [7] take a uniprocessor IBM AS 400 trace of TPC C and transform it to drive both a simulator and analytic models of coarse grain multithreaded uniprocessors. They conclude that multithreading is effective in hiding latency of OLTP. Verghese et al. 24] use a DSS workload to evaluate operating system ....
R. J. Eickemeyer, R. E. Johnson, S. R. Kunkel, M. S. Squillante, and S. Liu. Evaluation of multithreaded uniprocessors for commercial application environments. In Proceedings of the 21th Annual International Symposium on Computer Architecture, pages 203--212, June 1996.
....the first to study the performance of memory consistency models in the context of database workloads. There are a number of studies based on the performance of outof order processors for non database workloads(e.g. 8, 16, 18] Most previous studies of databases are based on in order processors [2, 4, 5, 6, 14, 20, 27, 28], and therefore do not address the benefits of more aggressive processor architectures. A number of the studies are limited to uniprocessor systems [4, 6, 13, 14] As discussed in Section 3, data communication misses play a more 3 The increase in the local and remote components of read latency ....
....for non database workloads(e.g. 8, 16, 18] Most previous studies of databases are based on in order processors [2, 4, 5, 6, 14, 20, 27, 28] and therefore do not address the benefits of more aggressive processor architectures. A number of the studies are limited to uniprocessor systems [4, 6, 13, 14]. As discussed in Section 3, data communication misses play a more 3 The increase in the local and remote components of read latency corresponds to the dirty misses that are converted to misses serviced by the memory. 4 Given the amount of tuning that is done on database benchmarks, it would ....
[Article contains additional citation context not shown here]
R. J. Eickemeyer, R. E. Johnson, S. R. Kunkel, M. S. Squillante, and S. Liu. Evaluation of multithreaded uniprocessors for commercial application environments. In Proceedings of the 21th Annual International Symposium on Computer Architecture, pages 203--212, June 1996.
....a Sequent cache coherent shared memory multiprocessor and highlighted the importance of process scheduling and the I O capability of the machine. Maynard et al. [6] contrasted the cache performance of technical and commercial workloads and concluded that the latter is often worse. Eickemeyer et al. [4] showed that a significant performance improvement can be obtained for OLTP workloads when a multithreaded processor is used. Finally, other studies that have involved database workloads include the work by Cvetanovic and Bhandarkar [1] on a DEC Alpha AXP system, Torrellas et al. [10] on an SGI ....
R. J. Eickemeyer, R. E. Johnson, S. R. Kunkel, M. S. Squillante, and S. Liu. Evaluation of Multithreaded Uniprocessors for Commercial Application Environments. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, pages 203-- 212, May 1996.
....increases. However, the disadvantage is a significant increase in response times at the switches (collectively S obs ) and at the local memory module (L obs ) 4 4 Agarwal [3] reports a deteriorating effect of partitioning of a cache at a large n t . Thekkath et al. 28] and Eickemeyer et al. [11] report little variations in cache miss rates ( 1 R ) due to multithreading. In this paper, we do not explore this application dependent phenomenon. In summary, we note the following points for the network latency tolerance: ffl Workload characteristics, and not the resulting S obs value, ....
R.J. Eickemeyer, R.E. Johnson, S.R. Kunkel, M.S. Squillante, and S. Liu. Evaluation of multithreaded uniprocessors for commercial application environments. In Proceedings of 23th Annual International Symposium on Computer Architecture. ACM, 1996.
....combinations of several high performance techniques: super This work was supported by the Ministry of Education of Spain under contract TIC 0429 95 and by the CEPBA. scalar out of order execution [2, 30, 37, 31, 6] decoupling [28, 27, 21] VLIW execution [12, 5, 36, 10] and multithreading [1, 32, 34, 19, 11]. The current generation of microprocessors all use superscalar execution coupled with a complex memory hierarchy based on several cache levels to attempt executing several instructions per cycle. VLIW processors have long been researched but have not reached the mass market due to their software ....
R. J. Eickemeyer, R. E. Johnson, S. R. Kunkel, M. S. Squillante, and S. Liu. Evaluation of multithreaded uniprocessors for commercial application environments. In ISCA, pages 203--212. ACM Press, May 1996.
....tasks asynchronously and that communicate through architectural queues. Latency is hidden by the fact that usually the address processor is able to slip ahead of the computation processor and start loading data that will be needed soon by the computation processor. Multithreaded scalar processors [1, 24, 25, 13, 5] attack the memory latency problem by switching between threads of computations every time a longlatency operation (such as a cache miss) threatens to halt the processor. This approach not only fights memory latency, but also produces a system with higher throughput and better resource ....
....latency and section 8 looks at the effects of the register file crossbars. Section 9 studies the effect of duplicating the control unit and finally section 10 presents our conclusions and future work. 2 Related work Multithreading for scalar programs has received much attention in recent years [24, 25, 13, 5] and has been found to be generally useful. In this paper we diverge from previous work in three key aspects. First, we will study multithreading in the context of vector architectures and highly vectorizable programs. There has been some research on interleaving of vector instructions at the ....
[Article contains additional citation context not shown here]
R. J. Eickemeyer, R. E. Johnson, S. R. Kunkel, M. S. Squillante, and S. Liu. Evaluation of multithreaded uniprocessors for commercial application environments. In ISCA, pages 203--212. ACM Press, May 1996.
No context found.
R. J. Eickemeyer, R. E. Johnson, S. R. Kunkel, M. S. Squillante, and S. Liu. Evaluation of multithreaded uniprocessors for commercial application environments. In 23rd Annual International Symposium on Computer Architecture, pages 203--212, May 1996.
No context found.
Richard J. Eickemeyer, Ross E. Johnson, Steven R. Kunkel, Mark S. Squillante, and Shiafun Liu. Evaluation of multithreaded uniprocessors for commercial application environments. In Proceedings of the Twenty-Third International Symposium on Computer Architecture, pages 203--212, 1996.
No context found.
R. J. Eickemeyer, R. E. Johnson, S. R. Kunkel, M. S. Squillante, and S. Liu. "Evaluation of multithreaded uniprocessors for commercial application environments," Proc. of the 21st ISCA, pages 203 - 212, June 1996.
No context found.
R. Eickemeyer, R. Johnson, S. Kunkel, M. Squillante, and S. Liu. Evaluation of Multithreaded Uniprocessors for Commercial Application Environments. In Proceedings of the 23rd International Symposium on Computer Architecture, pages 203--212, May 1996.
No context found.
Richard J. Eickemeyer, Ross E. Johnson, Steve R. Kunkel, Mark S. Squillante, and Shiafun Liu. Evaluation of multithreaded uniprocessors for commercial application environments. In Proceedings of the 23th Annual International Symposium on Computer Architecture, pages 203--212. ACM Press, May 1996. 220 Advanced Vector Architectures
No context found.
EIC96: R. J. Eickemeyer, R. E. Johnson, S. R. Kunkel, M. S. Squillante, S. Liu, "Evaluation of Multithreaded Uniprocessors for Commercial Applications", SIGARCH Comp Arch News, May 1996.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC