20 citations found. Retrieving documents...
J. Lo, S. Eggers, J. Emer, H. Levy, R. Stamm, and D. Tullsen. Converting thread-level parallelism into instruction level parallelism via simultaneous multithreading. ACM Transactions on Computer Systems, 15(3):322--354, Aug. 1997.

 Home/Search   Document Not in Database   Summary   Related Articles   Check  

This paper is cited in the following contexts:
Execution Performance of the Scheduled Dataflow Architecture - Kavi   (Correct)

....256 or larger, the available thread level parallelism in SDF (and the overlapped execution of SP and EP) 11 exceeds the available instruction level parallelism; leading to a better performance by SDF. This data is in line with the studies performed on Simultaneous Multithreading ( Mitchell 99] Lo 97] which indicate that high performance is achieved by using a combination of thread level and instruction level parallelism. Figure 5 shows this more clearly for larger data sizes SDF performs better than all other architectures. The figure plots execution time in cycles for different data ....

J.L. Lo, et al. "Converting thread-level parallelism into instruction-level parallelism via Simultaneous Multithreading", ACM Transactions on Computer Systems, Aug. 1997, pp 322-354.


Time-Shifted Modules: Exploiting Code Modularity for Fine.. - Zilles, Sohi   (Correct)

....data driven threads (which need not be contiguous) are more suited for smaller, high impact code fragments like slices of cache missing loads and mispredicting branches. 19 Previous studies exploring the performance of multithreaded code with SMT include: databases [22] web serving [26] SPLASH [23], and fine grain scientific programs [31] To our knowledge this paper includes the first analysis of fine grain integer programs on an SMT. 7 Conclusion In this paper, we have explored a technique to exploit the thread level concurrency that will be available in next generation processors. This ....

J. Lo, S. Eggers, J. Emer, H. Levy, R. Stamm, and D. Tullsen. Converting Thread-level Parallelism into Instruction -level Parallelism via Simultaneous Multithreading. ACM Transactions on Computers, 15(2), Aug. 1997.


The Need for Fast Communication in Hardware-Based.. - Krishnan, Torrellas (1999)   (6 citations)  (Correct)

....be based on single issue processors, thereby allowing more processors to be con gured on chip. In that case, the inter processor communication latency is not as crucial. However, exploiting both thread and instruction level parallelism is critical for the performance of multithreaded applications [15]. Thus CMPs are likely to be based on wide issue dynamic superscalar processors. Note that a fast interconnection may be used to communicate values without supporting register level communication. An example is the Superthreaded architecture [24] Here, an on chip memory bu er holds the dependent ....

J. Lo, S. Eggers, J. Emer, H. Levy, R. Stamm, and D. Tullsen. Converting Thread-Level Parallelism Into InstructionLevel Parallelism via Simultaneous Multithreading. ACM Transactions on Computer Systems, 15(3):322-354, August 1997.


CROPS: Coordinated Restructuring Of Programs and Storage - Carter, Ferrante.. (1999)   (Correct)

....and translation lookaside buffer entries at the finest resolution possible. By sharing resources at a fine granularity, SMT ideally renders instruction level parallelism (ILP) and TLP operationally equivalent; they introduce equally many independent instructions into the processor s pipelines [11]. Does operational equivalence imply performance equivalence That is given an incoming bandwidth of independent instructions, will SMT perform equally well, whether the independence comes from ILP or from TLP Problem: We argue that this is not always the case. Rather, implementing parallelism ....

Jack L. Lo, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Rebecca L. Stamm, and Dean M. Tullsen. Converting thread-level parallelism into instruction-level parallelism via simultaneous multithreading. ACM Transactions on Computer Systems, pages 322--354, August 1997.


Speculative Multithreading Architectures - Krishnan (1998)   (Correct)

....69, 76] it is very fast to perform. It could be argued that CMPs could be based on single issue processors thereby allowing more processors to be con gured on chip. However, exploiting both threadlevel and instruction level parallelism is critical for the performance of multithreaded applications [44]. Thus we assume that CMPs would be based on wide issue dynamic superscalar processors, rather than static scalar processors. In this context, we argue that the wide issue superscalar processors that will soon populate CMPs would make register communication a requirement for high performance. In ....

J.L. Lo, S.J. Eggers, J.S. Emer, H.M. Levy, R.L. Stamm, and D.M. Tullsen. Converting Thread-Level Parallelism Into Instruction-Level Parallelism via Simultaneous Multithreading. ACM Transactions on Computer Systems, 15(3):322-354, August 1997.


A Clustered Approach to Multithreaded Processors - Krishnan, Torrellas (1998)   (6 citations)  (Correct)

....using a resource, that resource can, typically, be utilized by another thread. Tullsen et al. [16] describe a fully centralized SMT architecture with a relatively small impact on a conventional superscalar design. Evaluation of this architecture when running multiprogrammed and parallel workloads [6, 9, 16] has shown significant speedups. However, a drawback of this approach is that it may inherit all the complexity of existing superscalars and, in addition, add extra hardware. In fact, with the delays in the register bypass network dictating the cycle time of future high issue processors [12] a ....

J. Lo, S. Eggers, J. Emer, H. Levy, R. Stamm, and D. Tullsen. Converting Thread-Level Parallelism Into Instruction-Level Parallelism via Simultaneous Multithreading. ACM Transactions on Computer Systems, pages 322--354, August 1997.


Executing Sequential Binaries on a Clustered Multithreaded.. - Krishnan, Torrellas (1998)   (4 citations)  (Correct)

....support multiple threads such that, in a given cycle, instructions from different threads can be issued. If, in a given cycle, a thread is not using a resource, that resource can typically be utilized by another thread. Evaluation of this architecture running multiprogrammed and parallel workloads [3, 4, 11] 1 This work was supported in part by the National Science Foundation under grants NSF Young Investigator Award MIP 9457436, ASC 9612099 and MIP 9619351, DARPA Contract DABT63 95 C0097, NASA Contract NAG 1 613, and gifts from IBM and Intel. has shown significant speedups. However, a drawback of ....

....IB0 IB0 PC Figure 4: SMT architectures: a) centralized and (b) clustered. Processor Number Threads per Max. IPC per Number of FUs per Type of Processors Processor [Chip] Processor [Chip] Processor [Chip] int ld st fp) Conventional Superscalar 1 1 [1] 8 [8] 8 4 4 [8 4 4] Centralized SMT 1 4 [4] 8 [8] 8 4 4 [8 4 4] Clustered SMT 2 2 [4] 4 [8] 4 2 2 [8 4 4] FA 2 2 1 [2] 4 [8] 4 2 2 [8 4 4] FA 4 4 1 [4] 2 [8] 2 1 1 [8 4 4] Table 4: Description of the different types of architectures evaluated. 3.3 Simulation Approach Our simulation environment is built upon the MINT [12] ....

[Article contains additional citation context not shown here]

J. Lo, S. Eggers, J. Emer, H. Levy, R. Stamm, and D. Tullsen. Converting thread-level parallelism into instruction-level parallelism via simultaneous multithreading. ACM Transactions on Computer Systems, pages 322--354, August 1997.


Proceedings of 12th Intl Conference on Parallel.. - Initial Observations Of   Self-citation (Tullsen)   (Correct)

No context found.

J. Lo, S. Eggers, J. Emer, H. Levy, R. Stamm, and D. Tullsen. Converting thread-level parallelism into instruction level parallelism via simultaneous multithreading. ACM Transactions on Computer Systems, 15(3):322--354, Aug. 1997.


Mini-threads: Increasing TLP on Small-Scale SMT Processors - Joshua Redstone Susan (2003)   (3 citations)  Self-citation (Eggers Levy)   (Correct)

....can improve performance significantly, particularly on small scale, space sensitive CPU designs. 1. Introduction Simultaneous Multithreading (SMT) is a latency tolerant CPU architecture that adds multiple hardware contexts to an out of order superscalar to dramatically improve machine throughput [14, 32, 25, 13]. Recently, several manufacturers have announced small scale SMTs (e.g. 2 to 4 thread contexts) both as single CPUs and as components of multiple CPUs on a chip [12, 30] While these small scale SMTs increase performance, they still leave modern wide issue CPUs with underutilized resources, ....

LO, J., EGGERS, S., EMER, J., LEVY, H., STAMM, R., AND TULLSEN, D. Converting thread-level parallelism into instruction-level parallelism via simultaneous multithreading. ACM Transactions on Computer Systems 15, 2 (August 1997).


Improving Server Software Support for Simultaneous.. - McDowell, Eggers..   Self-citation (Eggers)   (Correct)

.... SMT as effective in increasing instruction throughput (i.e. two to four fold speedups) on a variety of workloads (including scientific, database, and web servers, in both multiprogrammed and parallel environments) while still providing good performance for single threaded applications [42, 30, 31, 29]. For our purposes, the most important feature of SMT s architecture is that all contexts dynamically share most processor resources, including the functional units, caches, TLBs, and fetch bandwidth. The sharing of caches, in particular, makes inter thread data communication and synchronization ....

....application. Programmers may still provide their own mechanisms where appropriate, but providing reliable primitives for the common case makes good design sense. 7 RELATED WORK Prior research on multithreaded processors has considered mainly multiprogrammed [43, 15, 40] or scientific workloads [15, 30, 25] such as the SPLASH 2 benchmarks [47] rather than server applications. Evaluation of server applications, however, is particularly important, because several studies [5, 13, 39] have found that the performance of servers is often far worse than that of typical SPEC integer and floating point ....

J. Lo, S. Eggers, J. Emer, H. Levy, R. Stamm, and D. Tullsen. Converting thread-level parallelism into instruction-level parallelism via simultaneous multithreading. ACM Transactions on Computer Systems, 15(2), August 1997.


An Analysis of Software Interface Issues for SMT Processors - Redstone (2002)   (1 citation)  Self-citation (Eggers Levy)   (Correct)

....Privileged State Model Glue Clock PCIA Uart Disk Network System Bus 9 instruction level parallelism. Previous research has established that SMT is effective in increasing throughput on a variety of workloads, while still providing good performance for single threaded applications [81, 44, 45, 43, 87]. As a general purpose throughputenhancing mechanism, simultaneous multithreading is especially well suited to applications that are inherently multithreaded, such as database and Web servers, as well as multiprogrammed and parallel scientific workloads. At the hardware level, SMT is a ....

....modifications to an out of order superscalar necessary to support a four context SMT translated into only a 6 increase in chip area [27] 2.1.1. 2 SMT simulator core The SMT application level simulator is a detailed, stand alone, execution based simulator used extensively in previous SMT studies [22, 43, 44, 45, 46, 47, 59, 77, 81, 82, 83]. It models the processor pipeline and memory system in great detail. While the simulator excels at modelling user level code, it lacks the facilities necessary to accurately model an operating system. On a real machine, implicit or explicit user requests for OS service begin with a trap, ....

[Article contains additional citation context not shown here]

LO, J., EGGERS, S., EMER, J., LEVY, H., STAMM, R., AND TULLSEN, D. Converting thread-level parallelism into instruction-level parallelism via simultaneous multithreading. ACM Transactions on Computer Systems 15, 2 (August 1997).


ILP versus TLP on SMT - Mitchell, Carter, Ferrante, Tullsen (1999)   (3 citations)  Self-citation (Tullsen)   (Correct)

....pool, execution units, cache, and translation lookaside buffer (TLB) at the finest resolution possible. By sharing resources at a fine granularity, SMT ideally renders ILP and TLP operationally equivalent; both introduce equally many independent instructions into the processor s pipelines [7]. Does operational equivalence imply performance equivalence That is: Given an incoming bandwidth of independent instructions, will SMT perform equally well, whether the independence comes from ILP or from TLP In this paper, we argue that this is not always the case. Rather, given ILP and ....

....processor switched contexts every 4000 cycles, more coarsely than SMT. Tullsen et al. studied the performance of SMT on multiprogram workloads [18] They found that SMT sped up benchmarks upwards of five times. They did not look at the performance of SMT on single program workloads. Lo et al. [7] compared single program performance of SMT and multiprocessor configurations. They found SMT can be as much as 2.68 times faster than a multi processor. However, they did not factor in the effect of performance tuning on speedups. Compiler issues for multithreading: Lo et al. previously studied ....

Jack L. Lo, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Rebecca L. Stamm, and Dean M. Tullsen. Converting thread-level parallelism into instruction-level parallelism via simultaneous multithreading. ACM Transactions on Computer Systems, pages 32354, August 1997.


An Analysis of Operating System Behavior on a.. - Redstone, Eggers, Levy (2000)   (9 citations)  Self-citation (Eggers Levy)   (Correct)

....each cycle. SMT works by converting thread level parallelism into instruction level parallelism, effectively feeding instructions from different threads into the functional units of a wide issue, out of order superscalar processor [42, 41] Over the last six years, SMT has been broadly studied [22, 23, 21, 45, 24, 43, 35] and Compaq has recently announced that the Alpha 21464 will include SMT [10] As a general purpose throughputenhancing mechanism, simultaneous multithreading is especially well suited to applications that are inherently multithreaded, such as database and Web servers, as well as multiprogrammed ....

....better utilization of execution resources by converting thread level parallelism into instruction level parallelism. Previous research has established SMT as effective in increasing throughput on a variety of workloads, while still providing good performance for singlethreaded applications [41, 22, 23, 21, 45]. At the hardware level, SMT is a straightforward extension of modern, out of order superscalars, such as the MIPS R10000 [15] or the Alpha 21264 [16] SMT duplicates the register file, program counter, subroutine stack and internal processor registers of a superscalar to hold the state of ....

[Article contains additional citation context not shown here]

J. Lo, S. Eggers, J. Emer, H. Levy, R. Stamm, and D. Tullsen. Converting thread-level parallelism into instruction-level parallelism via simultaneous multithreading. ACM Transactions on Computer Systems, 15(2), August 1997.


Power-Sensitive Multithreaded Architecture - Seng, Tullsen, Cai (2000)   (10 citations)  Self-citation (Tullsen)   (Correct)

....That paper examines the power effects of pipelining and superscalar issue, but does not consider the effects of multithreading. Simultaneous multithreading has been shown to be an effective architecture to increase processor throughput both in multiprogrammed [15, 14] and parallel execution [10]. Previous work has demonstrated SMT s reduced dependence on speculation to achieve parallelism [14, 8] 3. Modelling Power This section describes the power model used to produce the energy and power results in the rest of the paper. This power model is integrated into a detailed cycle by cycle ....

J.L. Lo, S.J. Eggers, J.S. Emer, H.M. Levy, S.S. Parekh, R.L. Stamm, and D.M. Tullsen. Converting thread-level parallelism into instruction-level parallelism via simultaneous multithreading. ACM Transactions on Computer Systems, August 1997.


ILP versus TLP on SMT - Mitchell, Carter, Ferrante, Tullsen (1999)   (3 citations)  Self-citation (Tullsen)   (Correct)

....register pool, execution units, cache, and translation lookaside buffer (TLB) at the finest resolution possible. By sharing resources at a fine granularity, SMT ideally renders ILP and TLP operationally equivalent; both introduce equally many independent instructions into the processor s pipelines [7]. Does operational equivalence imply performance equivalence That is: Given an incoming bandwidth of independent instructions, will SMT perform equally well, whether the independence comes from ILP or from TLP In this paper, we argue that this is not always the case. Rather, given ILP and TLP ....

....processor switched contexts every 4000 cycles, more coarsely than SMT. Tullsen et al. studied the performance of SMT on multiprogram workloads [18] They found that SMT sped up benchmarks upwards of five times. They did not look at the performance of SMT on single program workloads. Lo et al. [7] compared single program performance of SMT and multiprocessor configurations. They found SMT can be as much as 2.68 times faster than a multiprocessor. However, they did not factor in the effect of performance tuning on speedups. Compiler issues for multithreading: Lo et al. previously studied ....

Jack L. Lo, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Rebecca L. Stamm, and Dean M. Tullsen. Converting thread-level parallelism into instruction-level parallelism via simultaneous multithreading. ACM Transactions on Computer Systems, pages 322--354, August 1997.


Supporting Fine-Grained Synchronization on a.. - Tullsen, Lo, Eggers.. (1999)   (14 citations)  Self-citation (Lo Eggers Levy Tullsen)   (Correct)

....other multithreaded architectures and multithreaded synchronization mechanisms. Other work which has impacted or is related to this study follows. There have been several papers that describe the design of simultaneous multithreading processors [18] and analyze their performance on parallel [12, 6, 13] and multiprogrammed [18, 6] workloads. The first group [12, 6, 13] which examined simultaneous multithreading with a more traditional multiprocessor workload, and in light of more traditional parallel compiler transformations is the most relevant to this work. These studies showed that an SMT ....

....mechanisms. Other work which has impacted or is related to this study follows. There have been several papers that describe the design of simultaneous multithreading processors [18] and analyze their performance on parallel [12, 6, 13] and multiprogrammed [18, 6] workloads. The first group [12, 6, 13], which examined simultaneous multithreading with a more traditional multiprocessor workload, and in light of more traditional parallel compiler transformations is the most relevant to this work. These studies showed that an SMT processor achieves significant speedups on code that is ....

J.L. Lo, S.J. Eggers, J.S. Emer, H.M. Levy, S.S. Parekh, R.L. Stamm, and D.M. Tullsen. Converting thread-level parallelism into instruction-level parallelism via simultaneous multithreading. ACM Transactions on Computer Systems, pages 322-- 354, August 1997.


Supporting Fine-Grained Synchronization on a Simultaneous.. - Dean Tullsen (1999)   (14 citations)  Self-citation (Lo Eggers Levy Tullsen)   (Correct)

....instructions from multiple threads in a single cycle [10, 9] Multithreaded processors provide an opportunity to greatly decrease synchronization cost, because the communicating threads are internal to a single processor. While previous work has shown the benefits of SMT on parallel workloads [6, 7], those studies relied on traditional synchronization mechanisms, ignoring the potential advantages (and problems) of synchronizing in an SMT CPU. A simultaneous multithreading processor differs from a conventional multiprocessor in several crucial ways that influence the design of SMT ....

J. Lo, S. Eggers, J. Emer, H. Levy, S. Parekh, R. Stamm, and D. Tullsen. Converting thread-level parallelism into instructionlevel parallelism via simultaneous multithreading. ACM Transactions on Computer Systems, August 1997.


Supporting Fine-Grained Synchronization on a.. - Tullsen, Lo, Eggers.. (1998)   (14 citations)  Self-citation (Lo Eggers Levy Tullsen)   (Correct)

....Section 2.1 described other multithreaded architectures and multithreaded synchronization mechanisms. Other work which has impacted or is related to this study follows. Previous papers describe the design of simultaneous multithreading processors [19] and analyze their performance on parallel [13, 6, 14], multiprogrammed [19, 6] and database [12] workloads. The first group [13, 6, 14] which examined simultaneous multithreading with a more traditional multiprocessor workload, and in light of more traditional parallel compiler transformations is the most relevant to this work. These studies ....

....mechanisms. Other work which has impacted or is related to this study follows. Previous papers describe the design of simultaneous multithreading processors [19] and analyze their performance on parallel [13, 6, 14] multiprogrammed [19, 6] and database [12] workloads. The first group [13, 6, 14], which examined simultaneous multithreading with a more traditional multiprocessor workload, and in light of more traditional parallel compiler transformations is the most relevant to this work. These studies showed that an SMT processor achieves significant speedups on code that is parallelized ....

J.L. Lo, S.J. Eggers, J.S. Emer, H.M. Levy, S.S. Parekh, R.L. Stamm, and D.M. Tullsen. Converting thread-level parallelism into instructionlevel parallelism via simultaneous multithreading. ACM Transactions on Computer Systems, pages 322--354, August 1997.


Hardware and Software Mechanisms for Multithreading in.. - Bradford (2001)   (Correct)

No context found.

Jack L. Lo, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Rebecca L. Stamm, and Dean M. Tullsen. Converting thread-level parallelism into instruction-level parallelism via simultaneous multithreading. ACM Transactions on Computer Systems, pages 322--354, August 1997.


Scalability of Scheduled Dataflow Architecture (SDF) with.. - Arul, Kavi   (Correct)

No context found.

. Lo. J. L., Eggers, S. J., Emer, J. S., Levy, H. M., Stamm, R. L., and Tullsen, D. M., "Converting thread-level parallelism into instruction-level parallelism via simultaneous multithreading," ACM Trans. on Computer Systems, Aug. 1997, pp. 332354.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC