• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

The Multiscalar Architecture”, (1993)

by M Franklin
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 125
Next 10 →

Simultaneous Multithreading: Maximizing On-Chip Parallelism

by Dean M. Tullsen , Susan J. Eggers, Henry M. Levy , 1995
"... This paper examines simultaneous multithreading, a technique permitting several independent threads to issue instructions to a superscalar’s multiple functional units in a single cycle. We present several models of simultaneous multithreading and compare them with alternative organizations: a wide s ..."
Abstract - Cited by 823 (48 self) - Add to MetaCart
This paper examines simultaneous multithreading, a technique permitting several independent threads to issue instructions to a superscalar’s multiple functional units in a single cycle. We present several models of simultaneous multithreading and compare them with alternative organizations: a wide superscalar, a fine-grain multithreaded processor, and single-chip, multiple-issue multiprocessing architectures. Our results show that both (single-threaded) superscalar and fine-grain multithreaded architectures are limited in their ability to utilize the resources of a wide-issue processor. Simultaneous multithreading has the potential to achieve 4 times the throughput of a superscalar, and double that of fine-grain multithreading. We evaluate several cache configurations made possible by this type of organization and evaluate tradeoffs between them. We also show that simultaneous multithreading is an attractive alternative to single-chip multiprocessors; simultaneous multithreaded processors with a variety of organizations outperform corresponding conventional multiprocessors with similar execution resources. While simultaneous multithreading has excellent potential to increase processor utilization, it can add substantial complexity to the design. We examine many of these complexities and evaluate alternative organizations in the design space.
(Show Context)

Citation Context

...asis similar to the Tera scheme. There is no simultaneous issue of instructions from multiple threads to functional units in the same cycle on individual clusters. Franklin's Multiscalar architecture =-=[13, 12]-=- assigns fine-grain threads to processors, so competition for execution resources (processors in this case) is at the level of a task rather than an individual instruction. Hirata, et al., [16] presen...

Complexity-effective superscalar processors

by Subbarao Palacharla, J. E. Smith, et al. - IN PROCEEDINGS OF THE 24TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE , 1997
"... The performance tradeoff between hardware complexity and clock speed is studied. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and operand bypassing are ana-lyzed. Each is modeled and Spice simulated for ..."
Abstract - Cited by 467 (5 self) - Add to MetaCart
The performance tradeoff between hardware complexity and clock speed is studied. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and operand bypassing are ana-lyzed. Each is modeled and Spice simulated for feature sizes of 0:8m, 0:35m, and 0:18m. Performance results and trends are expressed in terms of issue width and window size. Our analysis indicates that window wakeup and selection logic as well as operand bypass logic are likely to be the most critical in the future. A microarchitecture that simplifies wakeup and selection logic is proposed and discussed. This implementation puts chains of dependent instructions into queues, and issues instructions from multiple queues in parallel. Simulation shows little slowdown as compared with a completely flexible issue window when performance is measured in clock cycles. Furthermore, because only instructions at queue heads need to be awakened and selected, issue logic is simpli-fied and the clock cycle is faster – consequently overall performance is improved. By grouping dependent instructions together, the proposed microarchitecture will help minimize performance degradation due to slow bypasses in future wide-issue machines.

Exceeding the Dataflow Limit via Value Prediction

by Mikko H. Lipasti, John Paul Shen , 1996
"... ..."
Abstract - Cited by 291 (16 self) - Add to MetaCart
Abstract not found

The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization

by J. Gregory Steffan , Todd C. Mowry - HPCA-4 , 1998
"... As we look to the future, and the prospect of a billion transistors on a chip, it seems inevitable that microprocessors will exploit having multiple parallel threads. To achieve the full potential of these "single-chip multiprocessors," however, we must find a way to parallelize non-numeri ..."
Abstract - Cited by 256 (9 self) - Add to MetaCart
As we look to the future, and the prospect of a billion transistors on a chip, it seems inevitable that microprocessors will exploit having multiple parallel threads. To achieve the full potential of these "single-chip multiprocessors," however, we must find a way to parallelize non-numeric applications. Unfortunately, compilers have had little success in parallelizing non-numeric codes due to their complex access patterns. This paper explores the potential for using thread-level data speculation (TLDS) to overcome this limitation by allowing the compiler to view parallelization solely as a cost/benefit tradeoff, rather than something which is likely to violate program correctness. Our experimental results demonstrate that with realistic compiler support, TLDS can offer significant program speedups. We also demonstrate that through modest hardware extensions, a generic single-chip multiprocessor could support TLDS by augmenting its cache coherence scheme to detect dependence violations, and by using the primary data caches to buffer speculative state.
(Show Context)

Citation Context

...speculation has received much attention [5, 9, 20], the only relevant work on threadlevel data speculation for non-numeric codes when we performed our study was the Wisconsin Multiscalar architecture =-=[3, 4, 21]-=-. This tightly-coupled ring architecture assigns threads around the ring in program order, provides a hardware mechanism for forwarding register values between processors, and uses a centralized struc...

A Dynamic Multithreading Processor

by Haitham Akkary , et al.
"... We present an architecture that features dynamic multithreading execution of a single program. Threads are created automatically by hardware at procedure and loop boundaries and executed speculatively on a simultaneous multithreading pipeline. Data prediction is used to alleviate dependency constrai ..."
Abstract - Cited by 190 (5 self) - Add to MetaCart
We present an architecture that features dynamic multithreading execution of a single program. Threads are created automatically by hardware at procedure and loop boundaries and executed speculatively on a simultaneous multithreading pipeline. Data prediction is used to alleviate dependency constraints and enable lookahead execution of the threads. A two-level hierarchy significantly enlarges the instruction window. Efficient selective recovery from the second level instruction window takes place after a mispredicted input to a thread is corrected. The second level is slower to access but has the advantage of large storage capacity. We show several advantages of this architecture: (1) it minimizes the impact of ICache misses and branch mispredictions by fetching and dispatching instructions out-of-order, (2) it uses a novel value prediction and recovery mechanism to reduce artificial data dependencies created by the use of a stack to manage run-time storage, and (3) it improves the execution throughput of a superscalar by 15% without increasing the execution resources or cache bandwidth, and by 30% with one additional ICache fetch port. The speedup was measured on the integer SPEC95 benchmarks, without any compiler support, using a detailed performance simulator.

Dynamic Speculation and Synchronization of Data Dependences

by Andreas Moshovos, Scott E. Breach, T. N. Vijaykumar, Gurindar S. Sohi - In Proc. 24th International Symposium on Computer Architecture , 1997
"... Data dependence speculation is used in instruction-level parallel (ILP) processors to allow early execution of an instruction before a logically preceding instruction on which it may be data dependent. If the instruction is independent, data dependence speculation succeeds; if not, it fails, and the ..."
Abstract - Cited by 185 (22 self) - Add to MetaCart
Data dependence speculation is used in instruction-level parallel (ILP) processors to allow early execution of an instruction before a logically preceding instruction on which it may be data dependent. If the instruction is independent, data dependence speculation succeeds; if not, it fails, and the two instructions must be synchronized. The modern dynamically scheduled processors that use data dependence speculation do so blindly (i.e., every load instruction with unresolved dependences is speculated). In this paper, we demonstrate that as dynamic instruction windows get larger, significant performance benefits can result when intelligent decisions about data dependence speculation are made. We propose dynamic data dependence speculation techniques: (i) to predict if the execution of an instruction is likely to result in a data dependence mis-speculation, and (ii) to provide the synchronization needed to avoid a mis-speculation. Experimental results evaluating the effectiveness of the proposed techniques are presented within the context of a Multiscalar processor. 1
(Show Context)

Citation Context

...ion 5, we provide experimental data on the dynamic behavior of memory dependences and present an evaluation of an implementation of the method we propose within the context of a Multiscalar processor =-=[3,4,7,20]-=-. Finally, in section 6 we list what, in our opinion, are the contributions of this work and offer concluding remarks. In the discussion that follows we are concerned with data dependence speculation;...

Trace processors

by Eric Rotenberg, Quinn Jacobson, Yiannakis Sazeides, Jim Smith - IN PROCEEDINGS OF THE 30TH INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE , 1997
"... ..."
Abstract - Cited by 176 (15 self) - Add to MetaCart
Abstract not found
(Show Context)

Citation Context

...n and hierarchy, expose ILP via aggressive speculation, or do both. For the most part, this body of research focuses on hardwareintensive approaches to ILP. Work in the area of multiscalar processors =-=[8]-=-[9] first recognized the complexity of implementing wide instruction issue in the context of centralized resources. The result is an interesting combination of compiler and hardware. The compiler divi...

Compiler Optimization of Scalar Value Communication Between Speculative Threads

by Anonia Zhai, Chris Colohan, John Steffan, Todd Mowry - In Proceedings of the 10th ASPLOS , 2002
"... While there have been many recent proposals for hardware that supports Thread-Level Speculation (TLS), there has been relatively little work on compiler optimizations to fully exploit this potential for parallelizing programs optimistically. In this paper, we focus on one important limitation of pro ..."
Abstract - Cited by 90 (18 self) - Add to MetaCart
While there have been many recent proposals for hardware that supports Thread-Level Speculation (TLS), there has been relatively little work on compiler optimizations to fully exploit this potential for parallelizing programs optimistically. In this paper, we focus on one important limitation of program performance under TLS, which is stalls due to forwarding scalar values between threads that would otherwise cause frequent data dependences. We present and evaluate dataflow algorithms for three increasingly-aggressive instruction scheduling techniques that reduce the critical forwarding path introduced by the synchronization associated with this data forwarding. In addition, we contrast our compiler techniques with related hardware-only approaches. With our most aggressive compiler and hardware techniques, we improve performance under TLS by 6.2--28.5% for 6 of 14 applications, and by at least 2.7% for half of the other applications.
(Show Context)

Citation Context

... TLS support include some form of DOACROSS synchronization, although few use the compiler to optimize this aspect of speculative execution. The most relevant related work is the Wisconsin Multiscalar =-=[12, 28, 35]-=- compiler, which performs synchronization and scheduling for register values [35]. (The Multiscalar effort also evaluated hardware support for automatically detecting and synchronizing data dependence...

Dynamic cluster assignment mechanisms

by Ramon Canal, Joan Manuel Parcerisa, Antonio González - In Proceedings of HPCA-6 , 2000
"... Clustered microarchitectures are an effective approach to reducing the penalties caused by wire delays inside a chip. Current superscalar processors have in fact a two-cluster microarchitecture with a naive code partitioning approach: integer instructions are allocated to one cluster and floating-po ..."
Abstract - Cited by 77 (8 self) - Add to MetaCart
Clustered microarchitectures are an effective approach to reducing the penalties caused by wire delays inside a chip. Current superscalar processors have in fact a two-cluster microarchitecture with a naive code partitioning approach: integer instructions are allocated to one cluster and floating-point instructions to the other. This partitioning scheme is simple and results in no communications between the two clusters (just through memory) but it is in general far from optimal because the workload is not evenly distributed most of the time. In fact, when the processor is running integer programs, the workload is extremely unbalanced since the FP cluster is not used at all. In this work we investigate run-time mechanisms that dynamically distribute the instructions of a program among these two clusters. By optimizing the trade-off between inter-cluster communication penalty and workload balance, the proposed schemes can achieve an average speed-up of 36 % for the SpecInt95 benchmark suite.
(Show Context)

Citation Context

...ime. Other authors have proposed clustered microarchitectures in which the partitioning scheme focuses on reducing the control dependence penalties. Examples of such architectures are the Multiscalar =-=[8]-=- [18], SPSM [4], Superthreaded [19], Trace Processors [16] [20], Pews [10], Speculative Multithreaded [11] and Dynamic Multithreaded [1]. In such architectures, each cluster executes a different threa...

Billion-Transistor Architectures

by Doug Burger, James R.Goodman, James R , 1997
"... ns three articles, which appear in Cybersquare. Each describes one trend that will affect future microprocessor architectures. In the second category, each article makes the case for a different billion -transistor architecture. Although these articles represent the state of the art and the aut ..."
Abstract - Cited by 74 (5 self) - Add to MetaCart
ns three articles, which appear in Cybersquare. Each describes one trend that will affect future microprocessor architectures. In the second category, each article makes the case for a different billion -transistor architecture. Although these articles represent the state of the art and the authors' best guesses, the future is notoriously hard to predict in our breakneck-paced field. Technology trends are generally easier to predict than their effects, but trend estimates can be wildly inaccurate. Intel's 1989 prediction for 1996 processors underestimated performance by a factor of four. 1 Forecasting the effects of technology is even harder, as illustrated by several well-known quotes: . "Everything that can be invented has been invented." US Commissioner of Patents, 1899. . "I think there is a world market for about five computers. " Thomas J. Watson Sr., IBM founder, 1943. . "There is no reason for any individuals to have a computer in their home.
(Show Context)

Citation Context

...rol dependencies prevent a single processor from running very far ahead. The program execution could therefore benefit from having the processors function as stages in a Multiscalar-like architecture =-=[12, 25]-=-, wherein each processor speculatively executes large blocks of code. The processor thereby obtains a much larger instruction window in which to find sufficient ILP. • Communication-bound: a single pr...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University