Results 1 - 10
of
467
Wattch: A Framework for Architectural-Level Power Analysis and Optimizations
- In Proceedings of the 27th Annual International Symposium on Computer Architecture
, 2000
"... Power dissipation and thermal issues are increasingly significant in modern processors. As a result, it is crucial that power/performance tradeoffs be made more visible to chip architects and even compiler writers, in addition to circuit designers. Most existing power analysis tools achieve high ..."
Abstract
-
Cited by 1320 (43 self)
- Add to MetaCart
(Show Context)
Power dissipation and thermal issues are increasingly significant in modern processors. As a result, it is crucial that power/performance tradeoffs be made more visible to chip architects and even compiler writers, in addition to circuit designers. Most existing power analysis tools achieve high accuracy by calculating power estimates for designs only after layout or floorplanning are complete In addition to being available only late in the design process, such tools are often quite slow, which compounds the difficulty of running them for a large space of design possibilities.
Clock Rate versus IPC: The End of the Road for Conventional Microarchitectures
, 2000
"... The doubling of microprocessor performance every three years has been the result of two factors: more transistors per chip and superlinear scaling of the processor clock with technology generation. Our results show that, due to both diminishing improvements in clock rates and poor wire scaling as se ..."
Abstract
-
Cited by 324 (23 self)
- Add to MetaCart
The doubling of microprocessor performance every three years has been the result of two factors: more transistors per chip and superlinear scaling of the processor clock with technology generation. Our results show that, due to both diminishing improvements in clock rates and poor wire scaling as semiconductor devices shrink, the achievable performance growth of conventional microarchitectures will slow substantially. In this paper, we describe technology-driven models for wire capacitance, wire delay, and microarchitectural component delay. Using the results of these models, we measure the simulated performance---estimating both clock rate and IPC--- of an aggressive out-of-order microarchitecture as it is scaled from a 250nm technology to a 35nm technology. We perform this analysis for three clock scaling targets and two microarchitecture scaling strategies: pipeline scaling and capacity scaling. We find that no scaling strategy permits annual performance improvements of better than 12.5%, which is far worse than the annual 50-60% to which we have grown accustomed. 1
Clustered Speculative Multithreaded Processors
, 1999
"... In this paper we present a processor microarchitecture that can simultaneously execute multiple threads and has a clustered design for scalability purposes. A main feature of the proposed microarchitecture is its capability to spawn speculative threads from a single-thread application at run-time. T ..."
Abstract
-
Cited by 180 (10 self)
- Add to MetaCart
In this paper we present a processor microarchitecture that can simultaneously execute multiple threads and has a clustered design for scalability purposes. A main feature of the proposed microarchitecture is its capability to spawn speculative threads from a single-thread application at run-time. These speculative threaak use otherwise idle resources of the machine. Spawning a speculative thread involves predicting its control flow as well as its dependences with other threads and the values that flow through them. In this way, threads fhat are not independent can be executed in parallel. Control-Jlow, data value and data dependence predictors particularly designedfor this type of microarchitecture are presented. Results show the potential of the microarchitecture to exploit speculative parallelism in programs that are hard to parallelize at compile-time, such as the SpecInt9.5. For a 4-thread unit configuration, some programs such as ijpeg and Ii can exploit an average degree of parallelism of more than 2 threads per cycle. The average degree ofparallelism for the whole SpecInt95 suite is 1.6 threads per cycle. This speculative parallelism results in significant speedups for all the Speclnt95 programs when compared with a single-thread execution.
Dependence Based Prefetching for Linked Data Structures
, 1998
"... We introduce a dynamic scheme that captures the access patterns of linked data structures and can be used to predict future accesses with high accuracy. Our technique exploits the dependence relationships that exist between loads that produce addresses and loads that consume these addresses. By iden ..."
Abstract
-
Cited by 179 (13 self)
- Add to MetaCart
(Show Context)
We introduce a dynamic scheme that captures the access patterns of linked data structures and can be used to predict future accesses with high accuracy. Our technique exploits the dependence relationships that exist between loads that produce addresses and loads that consume these addresses. By identifying producer-consumer pairs, we construct a compact internal representation for the associated structure and its traversal. To achieve a prefetching effect, a small prefetch engine speculatively traverses this representation ahead of the executing program. Dependence-based prefetching achieves speedups of up to 25% on a suite of pointer-intensive programs. 1 Introduction Linked data structures (LDS) such as lists and trees are used in many important applications. The importance of LDS is growing with the increasing popularity of C++, Java, and other systems that use linked object graphs and function tables. Flexible, dynamic construction allows linked structures to grow large and diffic...
Trace processors
- IN PROCEEDINGS OF THE 30TH INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE
, 1997
"... ..."
(Show Context)
The Multicluster Architecture: Reducing Cycle Time Through Partitioning
, 1997
"... The multicluster architecture that we introduce offers a decentralized, dynamically-scheduled architecture, in which the register files, dispatch queue, and functional units of the architecture are distributed across multiple clusters, and each cluster is assigned a subset of the architectural regis ..."
Abstract
-
Cited by 176 (0 self)
- Add to MetaCart
(Show Context)
The multicluster architecture that we introduce offers a decentralized, dynamically-scheduled architecture, in which the register files, dispatch queue, and functional units of the architecture are distributed across multiple clusters, and each cluster is assigned a subset of the architectural registers. The motivation for the multicluster architecture is to reduce the clock cycle time, relative to a single-cluster architecture with the same number of hardware resources, by reducing the size and complexity of components on critical timing paths. Resource partitioning, however, introduces instruction-execution overhead and may reduce the number of concurrently executing instructions. To counter these two negative by-products of partitioning, we developed a static instruction scheduling algorithm. We describe this algorithm, and using tracedriven simulations of SPEC92 benchmarks, evaluate its effectiveness. This evaluation indicates that for the configurations considered, the multicluste...
A Chip-Multiprocessor Architecture with Speculative Multithreading.
- IEEE Trans. on Computers,
, 1999
"... ..."
(Show Context)
Converting Thread-Level Parallelism to Instruction-Level Parallelism via Simultaneous Multithreading
- ACM Transactions on Computer Systems
, 1997
"... This article explores parallel processing on an alternative architecture, simultaneous multithreading (SMT), which allows multiple threads to compete for and share all of the processor's resources every cycle. The most compelling reason for running parallel applications on an SMT processor is i ..."
Abstract
-
Cited by 147 (17 self)
- Add to MetaCart
This article explores parallel processing on an alternative architecture, simultaneous multithreading (SMT), which allows multiple threads to compete for and share all of the processor's resources every cycle. The most compelling reason for running parallel applications on an SMT processor is its ability to use thread-level parallelism and instruction-level parallelism interchangeably. By permitting This research was supported by Digital Equipment Corporation, the Washington Technology Center, NSF PYI Award MIP-9058439, NSF grants MIP-9632977, CCR-9200832, and CCR9632769, DARPA grant F30602-97-2-0226, ONR grants N00014-92-J-1395 and N00014-94-11136, and fellowships from Intel and the Computer Measurement Group.
Multiple-banked register file architectures
- In International Symposium on Computer Architecture(ISCA-27
, 2000
"... Abstract The register file access time is one of the critical delays in current superscalar processors. Its impact on processor performance is likely to increase in future processor generations, as they are expected to increase the issue width (which implies more register ports) and the size of the ..."
Abstract
-
Cited by 146 (12 self)
- Add to MetaCart
(Show Context)
Abstract The register file access time is one of the critical delays in current superscalar processors. Its impact on processor performance is likely to increase in future processor generations, as they are expected to increase the issue width (which implies more register ports) and the size of the instruction window (which implies more registers), and to use some kind of multithreading. Under this scenario, the register file access time could be a dominant delay and a pipelined implementation would be desirable to allow for high clock rates. However, a multi-stage register file has severe implications for processor performance (e.g. higher branch misprediction penalty) and complexity (more levels of bypass logic). To tackle these two problems, in this paper we propose a register file architecture composed of multiple banks. In particular we focus on a multi-level organization of the register file, which provides low latency and simple bypass logic. We propose several caching policies and prefetching strategies and demonstrate the potential of this multiple-banked organization. For instance, we show that a two-level organization degrades IPC by 10% and 2% with respect to a non-pipelined single-banked register file, for SpecInt95 and SpecFP95 respectively, but it increases performance by 87% and 92% when the register file access time is factored in. Keywords: Register file architecture, dynamically-scheduled processor, bypass logic, register file cache. Introduction Most current dynamically scheduled microprocessors have a RISClike instruction set architecture, and therefore, the majority of instruction operands reside in the register file. The access time of the register file basically depends on both the number of registers and the number of ports However, wider issue machines require more ports in the register file, which may significantly increase its access time Current trends in microprocessor design and technology lead to projections that the access time of a monolithic register file will be significantly higher than that of other common operations, such as integer additions. Under this scenario, a pipelined register file is critical to high performance; otherwise, the processor cycle time would be determined by the register file access time. However, pipelining a register file is not trivial. Moreover, a multi-cycle pipelined register file still causes a performance degradation in comparison with a single-cycle register file, since a multi-cycle register file increases the branch misprediction penalty. Besides, a multi-cycle register file either requires multiple levels of bypassing, which is one of the most time-critical components in current microprocessors, or processor performance will be significantly affected if only a single-level of bypassing is included. In this paper we propose a register file architecture that can achieve an IPC rate (instructions per cycle) much higher than a multi-cycle file and close to a single-cycle file, but at the same time it requires just a single level of bypass. The key idea is to have multiple register banks with a heterogeneous architecture, such that banks differ in number of registers, number of ports and thus, access time. We propose a run-time mechanism to allocate values to registers which aims to keep the most critical values in the fast banks, whereas the remaining values are held in slower banks. We show that the proposed organization degrades IPC by 10% and 2% with respect to a one-cycle single-banked register file for SpecInt95 and SpecFP95 respectively, assuming an infinite number of ports. However, when the register file cycle time is factored in and the best configuration in terms of instruction throughput (instruction per time unit) is chosen for each register file architecture, the proposed architecture outperforms the singlebanked register file by 87% and 92% respectively. When compared
The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays
- in Proceedings of the 29th Annual International Symposium on Computer Architecture
, 2002
"... Microprocessor clock frequency has improved by nearly 40% annually over the past decade. This improvement has been provided, in equal measure, by smaller technologies and deeper pipelines. From our study of the SPEC 2000 bench-marks, we find that for a high-performance architecture imple-mented in l ..."
Abstract
-
Cited by 120 (14 self)
- Add to MetaCart
(Show Context)
Microprocessor clock frequency has improved by nearly 40% annually over the past decade. This improvement has been provided, in equal measure, by smaller technologies and deeper pipelines. From our study of the SPEC 2000 bench-marks, we find that for a high-performance architecture imple-mented in lOOnm technology, the optimal clock period is ap-proximately 8fan-out-of-four ( F04) inverter delays for integer benchmarks, comprised of 6 F04 of useful work and an over-head of about 2 F04. The optimal clock period for floating-point benchmarks is 6F04. We find these optimal points to be insensitive to latch and clock skew overheads. Our study indi-cates that further pipelining can at best improve performance of integer programs by a factor of 2 over current designs. At these high clock frequencies it will be difficult to design the instruction issue window to operate in a single cycle. Con-sequently, we propose and evaluate a high-frequency design called a segmented instruction window. 1