Results 11 - 20
of
185
Instruction Issue Logic for Pipelined Supercomputers
- Proceedings of the 11th Annual Symposium on Computer Architecture
"... Basic pr inc ip les and des ign t radeof fs for cont ro l ol p ipe l ined processors are f i rst d iscussed. We concent ra te on reg is ter- reg is ter a rch i tec tures l ike the GRAY-1 where pipel ine cont ro l logic is local ized to one or two pipel ine s tages and is re fer red to as " ins ..."
Abstract
-
Cited by 66 (1 self)
- Add to MetaCart
Basic pr inc ip les and des ign t radeof fs for cont ro l ol p ipe l ined processors are f i rst d iscussed. We concent ra te on reg is ter- reg is ter a rch i tec tures l ike the GRAY-1 where pipel ine cont ro l logic is local ized to one or two pipel ine s tages and is re fer red to as " ins t ruct ion issue logic". Design tradeoffs are exp lored by giving des igns for a var ie ty of ins t ruct ion issue methods that represent a range of complex i ty and sophist icat ion. These vary f rom the or ig inal CRAY-1 issue logic to a vers ion of Tomasulo 's a lgor i thm, f i rst used in the IBM 360/91 f loating point unit. Also s tud ied are Thornton 's "scoreboard " algo-r i thm used on the CDC 8600 and an a lgor i thm we have devised. To provide a s tandard for compar ison, all the issue methods are used to imp lement the GRAY-1 sca lar a rch i tec ture. Then, us ing a s imulat ion model and the
Solution of Partial Differential Equations on Vector Computers
- Proc. 1977 Army Numerical Analysis and Computers Conference
, 1977
"... In this paper we review the present status of numerical methods for partial differential equations on vector and parallel computers. A discussion of the relevant aspects of these computers and a brief review of their development is included, with par-ticular attention paid to those characteristics t ..."
Abstract
-
Cited by 65 (0 self)
- Add to MetaCart
In this paper we review the present status of numerical methods for partial differential equations on vector and parallel computers. A discussion of the relevant aspects of these computers and a brief review of their development is included, with par-ticular attention paid to those characteristics that influence algorithm selecUon. Both direct and iteraUve methods are given for elliptic equations as well as explicit and impli-cit methods for initial-boundary value problems. The intent is to point out attractive methods as well as areas where this class of computer architecture cannot be fully utilized because of either hardware restricUons or the lack of adequate algorithms. A brief dis-cussion of application areas utilizing these computers is included.
Analysis of Multithreaded Architectures for Parallel Computing
"... Multithreading has been proposed as an architectural strategy for tolerating latency in multiprocessors and, through limited empirical studies, shown to offer promise. This paper develops an analytical model of multithreaded processor behavior based on a small set of architectural and program parame ..."
Abstract
-
Cited by 64 (4 self)
- Add to MetaCart
Multithreading has been proposed as an architectural strategy for tolerating latency in multiprocessors and, through limited empirical studies, shown to offer promise. This paper develops an analytical model of multithreaded processor behavior based on a small set of architectural and program parameters. The model gives rise to a large Markov chain, which is solved to obtain a formula for processor efficiency in terms of the number of threads per processor, the remote reference rate, the latency, and the cost of switching between threads. It is shown that a multithreaded processor exhibits three operating regimes: linear (efficiency is proportional to the number of threads), transition, and saturation (efficiency depends only on the remote reference rate and switch cost). Formulae for regime boundaries are derived. The model is embellished to reflect cache degradation due to multithreading, using an analytical model of cache behavior, demonstrating that returns diminish as the number threads becomes large. Predictions from the embellished model correlate well with published empirical measurements. Prescriptive use of the model under various scenarios indicates that multithreading is effective, but the number of useful threads per processor is fairly small.
Out-of-Order Vector Architectures
, 1997
"... Register renaming and out-of-order instruction issue are now commonly used in superscalar processors. These techniques can also be used to significant advantage in vector processors, as this paper shows. Performance is improved and available memory bandwidth is used more effectively. Using a trace d ..."
Abstract
-
Cited by 59 (21 self)
- Add to MetaCart
Register renaming and out-of-order instruction issue are now commonly used in superscalar processors. These techniques can also be used to significant advantage in vector processors, as this paper shows. Performance is improved and available memory bandwidth is used more effectively. Using a trace driven simulation we compare a conventional vector implementation, based on the Convex C3400, with an out-of-order, register renaming, vector implementation. When the number of physical registers is above 12, out-of-order execution coupled with register renaming provides a speedup of 1.24--1.72 for realistic memory latencies. Out-of-order techniques also tolerate main memory latencies of 100 cycles with a performance degradation less than 6%. The mechanisms used for register renaming and out-of-order issue can be used to support precise interrupts -- generally a difficult problem in vector machines. When precise interrupts are implemented, there is typically less than a 10% degradation in performance. A new technique based on register renaming is targeted at dynamically eliminating spill code; this technique is shown to provide an extra speedup ranging between 1.10 and 1.20 while reducing total memory traffic by an average of 15--20%.
Efficient Conditional Operations for Data-parallel Architectures
- In Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture
, 2000
"... Many data-parallel applications, including emerging media applications, have regular structures that can easily be expressed as a series of arithmetic kernels operating on data streams. Data-parallel architectures are designed to exploit this regularity by performing the same operation on many data ..."
Abstract
-
Cited by 54 (9 self)
- Add to MetaCart
(Show Context)
Many data-parallel applications, including emerging media applications, have regular structures that can easily be expressed as a series of arithmetic kernels operating on data streams. Data-parallel architectures are designed to exploit this regularity by performing the same operation on many data elements concurrently. However, applications containing data-dependent control constructs perform poorly on these architectures. Conditional streams convert these constructs into data-dependent data movement. This allows data-parallel architectures to efficiently execute applications with data-dependent control flow. Essentially, conditional streams extend the range of applications that a data-parallel architecture can execute efficiently. For example, polygon rendering speeds up by a factor of 1.8 with the use of conditional streams. 1. Introduction Many applications contain abundant data-parallelism, particularly emerging media applications such as graphics and video, image, and signal p...
Compiling for the Multiscalar Architecture
, 1998
"... High-performance, general-purpose microprocessors serve as compute engines for computers ranging from personal computers to supercomputers. Sequential programs constitute a major portion of real-world software that run on the computers. State-of-the-art microprocessors exploit instruction level para ..."
Abstract
-
Cited by 53 (2 self)
- Add to MetaCart
High-performance, general-purpose microprocessors serve as compute engines for computers ranging from personal computers to supercomputers. Sequential programs constitute a major portion of real-world software that run on the computers. State-of-the-art microprocessors exploit instruction level parallelism (ILP) to achieve high performance on such applications by searching for independent instructions in a dynamic window of instructions and executing them on a wide-issue pipeline. Increasing the window size and the issue width to extract more ILP may hinder achieving high clock speeds, limiting overall performance. The Multiscalar architecture employs multiple small windows and many narrow-issue processing units to exploit ILP at high clock speeds. Sequential programs are partitioned into code fragments called tasks, which are speculatively executed in parallel. Inter-task register dependences are honored via communication and synchronization and inter-task control flow and memory depe...
Dynamic warp subdivision for integrated branch and memory divergence tolerance: Extended results.
, 2010
"... ABSTRACT SIMD organizations amortize the area and power of fetch, decode, and issue logic across multiple processing units in order to maximize throughput for a given area and power budget. However, throughput is reduced when a set of threads operating in lockstep (a warp) are stalled due to long l ..."
Abstract
-
Cited by 50 (6 self)
- Add to MetaCart
(Show Context)
ABSTRACT SIMD organizations amortize the area and power of fetch, decode, and issue logic across multiple processing units in order to maximize throughput for a given area and power budget. However, throughput is reduced when a set of threads operating in lockstep (a warp) are stalled due to long latency memory accesses. The resulting idle cycles are extremely costly. Multi-threading can hide latencies by interleaving the execution of multiple warps, but deep multi-threading using many warps dramatically increases the cost of the register files (multi-threading depth × SIMD width), and cache contention can make performance worse. Instead, intra-warp latency hiding should first be exploited. This allows threads that are ready but stalled by SIMD restrictions to use these idle cycles and reduces the need for multi-threading among warps. This paper introduces dynamic warp subdivision (DWS), which allows a single warp to occupy more than one slot in the scheduler without requiring extra register file space. Independent scheduling entities allow divergent branch paths to interleave their execution, and allow threads that hit to run ahead. The result is improved latency hiding and memory level parallelism (MLP). We evaluate the technique on a coherent cache hierarchy with private L1 caches and a shared L2 cache. With an area overhead of less than 1%, experiments with eight data-parallel benchmarks show our technique improves performance on average by 1.7X.
Sentinel scheduling: a model for compiler-controlled speculative execution
- ACM Transactions on Computer Systems
, 1993
"... Speculative execution is an important source of parallelism for VLIW and superscalar processors. A serious challenge with compiler-controlled speculative execution is to e ciently handle exceptions for speculative instructions. In this paper, a set of architectural features and compiletime schedulin ..."
Abstract
-
Cited by 47 (14 self)
- Add to MetaCart
(Show Context)
Speculative execution is an important source of parallelism for VLIW and superscalar processors. A serious challenge with compiler-controlled speculative execution is to e ciently handle exceptions for speculative instructions. In this paper, a set of architectural features and compiletime scheduling support collectively referred to as sentinel scheduling is introduced. Sentinel scheduling provides an e ective framework for both compiler-controlled speculative execution and exception handling. All program exceptions are accurately detected and reported in a timely manner with sentinel scheduling. Recovery from exceptions is also ensured with the model. Experimental results show the e ectiveness of sentinel scheduling for exploiting instruction-level parallelism and the overhead associated with exception handling. Categories and Subject Descriptors: B.3.2 [Memory Structures]: Design Styles{associative memories � C.0[Computer Systems Organization]: General{hardware/software interfaces� instruction set design � system architectures � C.1.2 [Processor Architectures]: Single Data Stream Architectures{pipeline processors � D.2.4 [Software Engineering]: Testing and Debugging{
Evaluating the Effects of Predicated Execution on Branch Prediction
- in Proceedings of the 27th International Symposium on Microarchitecture
, 1994
"... High performance architectures have always had to deal with the performance-limiting impact of branch operations. Microprocessor designs are going to have to deal with this problem as well, as they move towards deeper pipelines and support for multiple instruction issue. Branch prediction schemes ar ..."
Abstract
-
Cited by 43 (3 self)
- Add to MetaCart
High performance architectures have always had to deal with the performance-limiting impact of branch operations. Microprocessor designs are going to have to deal with this problem as well, as they move towards deeper pipelines and support for multiple instruction issue. Branch prediction schemes are often used to alleviate the negative impact of branch operations by allowing the speculative execution of instructions after an unresolved branch. Another technique is to eliminate branch instructions altogether. Predication can remove forward branch instructions by translating the instructions following the branch into predicate form. This paper analyzes a variety of existing predication models for eliminating branch operations, and the effect that this elimination has on the branch prediction schemes in existing processors, including single issue architectures with simple prediction mechanisms, to the newer multi-issue designs with correspondingly more sophisticated branch predictors. T...
Data relocation and prefetching for programs with large data sets
- Department of Computer Science, University of Illinois
, 1995
"... Numerical applications frequently contain nested loop structures that process large arrays of data. The execution of these loop structures often produces memory preference patterns that poorly utilize data caches. Limited associativity and cache capacity result in cache con ict misses. Also, non-uni ..."
Abstract
-
Cited by 37 (1 self)
- Add to MetaCart
(Show Context)
Numerical applications frequently contain nested loop structures that process large arrays of data. The execution of these loop structures often produces memory preference patterns that poorly utilize data caches. Limited associativity and cache capacity result in cache con ict misses. Also, non-unit stride access patterns can cause low utilization of cache lines. Data copying has been proposed and investigated in order to reduce the cache con ict misses [1][2], but this technique has a high execution overhead since it does the copy operations entirely in software. We propose a combined hardware and software technique called data relocation and prefetching which eliminates much of the overhead of data copying through the use of special hardware. Furthermore, by relocating the data while performing software prefetching, the overhead of copying the data can be reduced further. Experimental results for data relocation and prefetching are encouraging and show a large improvement incache performance. Index terms- Cache con icts, data copying, data relocation, program optimization, software prefetching. 1