| Jack L. Lo, Susan J. Eggers, Henry M. Levy, Sujay S. Parekh, and Dean M. Tullsen. Tuning compiler optimizations for simultaneous multithreading. In International Symposium on Microarchitecture, pages 114--124, 1997. |
No context found.
J. Lo, S. Eggers, H. Levy, S. Parekh, and D. Tullsen. Tuning compiler optimizations for simultaneous multithreading. In 30th International Symposium on Microarchitecture, Dec. 1997.
No context found.
J. L. Lo, S. J. Eggers, H. M. Levy, S. S. Parekh, and D. M. Tullsen. Tuning compiler optimizations for simultaneous multithreading. In 30th International Symposium on Microarchitecture, pages 114--124, December 1997.
.... executes multiple threads of control, and instructions from these threads execute in parallel; however, SMT s thread shared data structures are more effectively utilized if they are organized so as to encourage sharing (even false sharing ) as is often the default organization on uniprocessors [31]. The crucial task for researchers is to determine which software data structures and policies, multiprocessor, uniprocessor, or neither, are most appropriate for SMT. To address this challenge, this paper investigates the performance impact and optimization of three software issues that SMT ....
.... SMT as effective in increasing instruction throughput (i.e. two to four fold speedups) on a variety of workloads (including scientific, database, and web servers, in both multiprogrammed and parallel environments) while still providing good performance for single threaded applications [42, 30, 31, 29]. For our purposes, the most important feature of SMT s architecture is that all contexts dynamically share most processor resources, including the functional units, caches, TLBs, and fetch bandwidth. The sharing of caches, in particular, makes inter thread data communication and synchronization ....
[Article contains additional citation context not shown here]
J. Lo, S. Eggers, H. Levy, S. Parekh, and D. Tullsen. Tuning compiler optimizations for simultaneous multithreading. In 30th Annual International Symposium on Microarchitecture, December 1997.
....Privileged State Model Glue Clock PCIA Uart Disk Network System Bus 9 instruction level parallelism. Previous research has established that SMT is effective in increasing throughput on a variety of workloads, while still providing good performance for single threaded applications [81, 44, 45, 43, 87]. As a general purpose throughputenhancing mechanism, simultaneous multithreading is especially well suited to applications that are inherently multithreaded, such as database and Web servers, as well as multiprogrammed and parallel scientific workloads. At the hardware level, SMT is a ....
....modifications to an out of order superscalar necessary to support a four context SMT translated into only a 6 increase in chip area [27] 2.1.1. 2 SMT simulator core The SMT application level simulator is a detailed, stand alone, execution based simulator used extensively in previous SMT studies [22, 43, 44, 45, 46, 47, 59, 77, 81, 82, 83]. It models the processor pipeline and memory system in great detail. While the simulator excels at modelling user level code, it lacks the facilities necessary to accurately model an operating system. On a real machine, implicit or explicit user requests for OS service begin with a trap, ....
[Article contains additional citation context not shown here]
LO, J., EGGERS, S., LEVY, H., PAREKH, S., AND TULLSEN, D. Tuning compiler optimizations for simultaneous multithreading. In Proceedings of the International Symposium on Microarchitecture (December 1997).
....of the inner loop, 4 and 2x2 and 4x4 tiles. For each of these variants, we implement three cache tiling variants: no tiling, filed with the outer loop block distributed to threads, and tiled with an inner loop cyclically distributed to threads. 5 Furthermore, we varied the size of 3Previous work [8] used 256 x 128 x 64 matrices. They reported that tile size did not affect performance. For their processor configuration, this problem was too small to make cache tiling worthwhile. 49cc did not hoist, even on the highest optimization level. Previously, 8] compared block and cyclic ....
....we varied the size of 3Previous work [8] used 256 x 128 x 64 matrices. They reported that tile size did not affect performance. For their processor configuration, this problem was too small to make cache tiling worthwhile. 49cc did not hoist, even on the highest optimization level. Previously, [8] compared block and cyclic distribution on a variety of problems. They did not tile for registers. cac no register tile (none Eblock cychc 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Number of threads Number of threads Figure 2: The best tuned matrix multiply, at 1.36 cycles per inner loop ....
[Article contains additional citation context not shown here]
Jack L. Lo, Susan J. Eggers, Henry M. Levy, Sujay S. Parekh, and Dean M. Tullsen. Tuning compiler optimizations for simultaneous multithreading. In International Symposium on Microarchitecture, December 1997.
....noteworthy behavior. Loops with independent iterations are basically vector computations that can be carried on independently for every element. They are easy to parallelize, using iteration interleaving. This consists in assigning iterations to threads not in blocks, but interleaved. As shown in [5], in these cases this is the most ef cient way to express parallelism: better cache and TLB utilization is reached with this solution. These kind of loops are easily run on the SMT processor, with good speedup. Figure 1 shows statitics 2 0 2 4 6 8 10 12 14 16 Number of threads 0 20 40 60 ....
J. L. Lo, S. J. Eggers, H. M. Levy, S. S. Parekh, and D. M. Tullsen. Tuning compiler optimizations for simultaneous multithreading. 30th Annual International Symposium on Microarchitecture (Micro-30), December 1997.
....each cycle. SMT works by converting thread level parallelism into instruction level parallelism, effectively feeding instructions from different threads into the functional units of a wide issue, out of order superscalar processor [42, 41] Over the last six years, SMT has been broadly studied [22, 23, 21, 45, 24, 43, 35] and Compaq has recently announced that the Alpha 21464 will include SMT [10] As a general purpose throughputenhancing mechanism, simultaneous multithreading is especially well suited to applications that are inherently multithreaded, such as database and Web servers, as well as multiprogrammed ....
....better utilization of execution resources by converting thread level parallelism into instruction level parallelism. Previous research has established SMT as effective in increasing throughput on a variety of workloads, while still providing good performance for singlethreaded applications [41, 22, 23, 21, 45]. At the hardware level, SMT is a straightforward extension of modern, out of order superscalars, such as the MIPS R10000 [15] or the Alpha 21264 [16] SMT duplicates the register file, program counter, subroutine stack and internal processor registers of a superscalar to hold the state of ....
[Article contains additional citation context not shown here]
J. Lo, S. Eggers, H. Levy, S. Parekh, and D. Tullsen. Tuning compiler optimizations for simultaneous multithreading. In 30th Annual International Symposium on Microarchitecture, December 1997.
....than SMT for OLTP and DSS, respectively. By interleaving instructions from multiple threads, and by choosing to fetch from threads that are making the most effective utilization of the execution resources [23] SMT reduces the need for (and more importantly, the cost of) speculative execution [10]. SMT also greatly reduces the number of cycles in which no instructions can be fetched due to misfetches or I cache misses. On the DSS workload SMT nearly eliminates all zero fetch cycles. On OLTP, fetch stalls are reduced by 78 ; zero fetch cycles are still 15.5 , because OLTP instruction cache ....
J. Lo, et al. Tuning compiler optimizations for simultaneous multithreading. In 30th Int'l Symp. on Microarchitecture, p. 114--124, December 1997.
....inner loop, 4 and 2x2 and 4x4 tiles. For each of these variants, we implement three cache tiling variants: no tiling, tiled with the outer loop block distributed to threads, and tiled with an inner loop cyclically distributed to threads. 5 Furthermore, we varied the size of 3 Previous work [8] used 256 Theta 128 Theta 64 matrices. They reported that tile size did not affect performance. For their processor configuration, this problem was too small to make cache tiling worthwhile. 4 gcc did not hoist, even on the highest optimization level. 5 Previously, 8] compared block and ....
....of 3 Previous work [8] used 256 Theta 128 Theta 64 matrices. They reported that tile size did not affect performance. For their processor configuration, this problem was too small to make cache tiling worthwhile. 4 gcc did not hoist, even on the highest optimization level. 5 Previously, [8] compared block and cyclic distribution on a variety of problems. They did not tile for registers. Appeared in the Proceedings of Supercomputing 99. 3 1 2 3 4 5 6 7 8 Number of threads 0 2 4 6 8 10 12 Cycles per element none block cyclic no load store hoisting cache tile 1 2 3 4 5 6 7 8 ....
[Article contains additional citation context not shown here]
Jack L. Lo, Susan J. Eggers, Henry M. Levy, Sujay S. Parekh, and Dean M. Tullsen. Tuning compiler optimizations for simultaneous multithreading. In International Symposium on Microarchitecture, December 1997.
....from multiple threads in a single cycle [19, 18] Multithreaded processors, such as SMT, provide an opportunity to greatly decrease synchronization cost, because the communicating threads are internal to a single processor. While previous work has shown the benefits of SMT on parallel workloads [6, 13], those studies relied on traditional synchronization mechanisms, ignoring the potential advantages (and problems) of synchronizing in an SMT CPU. A simultaneous multithreaded processor differs from a conventional multiprocessor in several crucial ways that influence the design of SMT ....
....other multithreaded architectures and multithreaded synchronization mechanisms. Other work which has impacted or is related to this study follows. There have been several papers that describe the design of simultaneous multithreading processors [18] and analyze their performance on parallel [12, 6, 13] and multiprogrammed [18, 6] workloads. The first group [12, 6, 13] which examined simultaneous multithreading with a more traditional multiprocessor workload, and in light of more traditional parallel compiler transformations is the most relevant to this work. These studies showed that an SMT ....
[Article contains additional citation context not shown here]
J.L. Lo, S.J. Eggers, H.M. Levy, S.S. Parekh, and D.M. Tullsen. Tuning compiler optimizations for simultaneous multithreading. In 30th Annual International Symposium on Microarchitecture, December 1997.
....instructions from multiple threads in a single cycle [10, 9] Multithreaded processors provide an opportunity to greatly decrease synchronization cost, because the communicating threads are internal to a single processor. While previous work has shown the benefits of SMT on parallel workloads [6, 7], those studies relied on traditional synchronization mechanisms, ignoring the potential advantages (and problems) of synchronizing in an SMT CPU. A simultaneous multithreading processor differs from a conventional multiprocessor in several crucial ways that influence the design of SMT ....
J. Lo, S. Eggers, H. Levy, S. Parekh, and D. Tullsen. Tuning compiler optimizations for simultaneous multithreading. In International Symposium on Microarchitecture, December 1997.
....from multiple threads in a single cycle [20, 19] Multithreaded processors, such as SMT, provide an opportunity to greatly decrease synchronization cost, because the communicating threads are internal to a single processor. While previous work has shown the benefits of SMT on parallel workloads [6, 14], those studies relied on traditional synchronization mechanisms, ignoring the potential advantages (and problems) of synchronizing in an SMT CPU. A simultaneous multithreading processor differs from a conventional multiprocessor in several crucial ways that influence the design of SMT ....
....Section 2.1 described other multithreaded architectures and multithreaded synchronization mechanisms. Other work which has impacted or is related to this study follows. Previous papers describe the design of simultaneous multithreading processors [19] and analyze their performance on parallel [13, 6, 14], multiprogrammed [19, 6] and database [12] workloads. The first group [13, 6, 14] which examined simultaneous multithreading with a more traditional multiprocessor workload, and in light of more traditional parallel compiler transformations is the most relevant to this work. These studies ....
[Article contains additional citation context not shown here]
J.L. Lo, S.J. Eggers, H.M. Levy, S.S. Parekh, and D.M. Tullsen. Tuning compiler optimizations for simultaneous multithreading. In 30th Annual International Symposium on Microarchitecture, December 1997.
No context found.
Jack L. Lo, Susan J. Eggers, Henry M. Levy, Sujay S. Parekh, and Dean M. Tullsen. Tuning compiler optimizations for simultaneous multithreading. In International Symposium on Microarchitecture, pages 114--124, 1997.
No context found.
J. Lo, S. Eggers, H. Levy, S. Parekh, and D. Tullsen. Tuning compiler optimizations for simultaneous multithreading. In Proceedings of 30th MICRO, pages 114--124, December 1997.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC