| A. Gupta, J. Hennessy, K. Gharachorloo, T. Mowry, and W.-D. Weber, "Comparative evaluation of latency reducing and tolerating techniques," Proceedings of the 18th Annual International Symposium on Computer Architecture, pp. 254-263, 1991. |
....SMT under a variety of application level workloads. Some workloads examined include SPEC (92 and 95) 82, 81] SPLASH 2 [44] MPEG 2 decompression [68] and a database workload [43] Evaluations of other multithreading 52 and CMP architectures have similarly been limited to application code only [3, 35, 15, 2, 71, 41, 30] or PALcode [91] Our study is the first to measure operating system behavior on a simultaneous multithreading architecture. SMT differs significantly from previous architectures with respect to operating system execution, because kernel instructions from multiple threads can execute ....
GUPTA, A., HENNESSY, J., GHARACHORLOO, K., MOWRY, T., AND WEBER, W. Comparative evaluation of latency reducing and tolerating techniques. In Proceedings of the International Symposium on Computer Architecture (May 1991).
....which assumes the use of fine grain threads in the application, can suspend a kernel process between the execution of two threads. Their experiments show that having one process per processor results in significant performance improvement when compared to a time slicing policy. Subsequent work by Gupta et al. [1991b] investigated the effects of different scheduling policies and syn chronization primitives on an UMA multiprocessor using simulation. They showed that in the presence of multiprogramming, blocking primitives always outperform spinning primi tives. They also showed that coscheduling and ....
....is to recognize the dominant role of communication in current systems, and to adopt techniques for reduc ing communication in parallel programs. Cache architecture sensitive parallel application restructuring (CASPAR) Cheriton et al. 1991] latency tolerant techniques [Agarwal et al. 1990; Gupta et al. 1991a] and the scheduling schemes discussed here are all steps in the right direction. These techniques will be even more important in the future if shared memory machines are to be used efficiently for parallel programming. Acknowledgements The authors would like to thank Prakash Das, Mark Crovella ....
A. Gupta, J. Hennessy, K. Gharachorloo, T. Mowry, and W.-D. Weber, "Comparative Evaluation of Latency Reducing and Tolerating Techniques," In Proceedings of the 18th International Symposium on Computer Architecture, pages 254-263, May 1991.
....multiprocessor will be expensive. The incremental cost of sending an additional word in a message is A. This parameter is equal to 10 in our simulations. These parameter values are representative of the costs we would expect to incur in a scalable multiprocessor. As in the DASH multiprocessor [12], a cache fill from a remote node can be expensive. For example, it costs about 100 cache cycles to satisfy a cache miss from remote memory, including 40 cycles to send a request for data to another node, 50 cycles or more to send the reply, and 5 cycles to retrieve the data from memory. We ....
....fraction of writes to shared data is one of the most important parameters affecting the performance of scalable caches. It is interesting, therefore, to understand how this fraction varies across applications. The fraction of writes to shared data for the 3 benchmarks on 16 processors discussed in [12] are 31 (MP3D) 33 (LU) and 11 (PTHOR) In the applications discussed in [13] shared writes constituted 17 (Maxflow) 5.6 (SA TSP) 19 (MP3D) 6.8 (PTHOR) and 5 (LocusRoute) of all shared references. In [9] the figures are 7 (PLOVER) 22 (PSPICE) 10 (PUPPY) and 2 (TOPOPT) For ....
A. Gupta, J. Hennessy, K. Gharachorloo, Todd Mowry, and W.D. Weber. Comparative evaluation of latency reducing and tolerating techniques. In Proc. 18th International Symposium on Computer Architecture, pages 254-263, May 1991.
....[10, 5, 1, 9] In software prefetching, it is the programmer or compiler who is responsible for deciding when and what is going to be brought to the cache or to a register. Most research on software prefetching has been devoted to regular access patterns as those found in numerical applications [7, 2, 11, 8], but lately there has also been research that tries to detect and prefetch recursive data structures [13, 17] which appear in non numerical applications. Software prefetching can be classified to be non binding or binding, depending on whether the data is brought to L1 or to the register file. ....
A. Gupta, J. Hennessy, K. Gharachorloo, T. Mowry, and W. Weber. Comparative evaluation of latency reducing and tolerating techniques. 18th Annual International Symposium on Computer Architecture, May 1991.
....our current study achieves clustering through unroll and jam and shows how this technique can actually improve prefetching. Finally, this work has focused on software latency tolerance techniques. Hardware techniques such as hardware prefetching or multithreading also provide latency tolerance [9, 11, 14, 31]. The interaction of read miss clustering with such hardware techniques remains an open question. 8 Conclusions and Future Work This work compares and combines two latency hiding techniques, read miss clustering and software prefetching. For the applications and systems we study, clustering ....
A. Gupta, J. Hennessy, K. Gharachorloo, T. Mowry, and W.- D. Weber. Comparative Evaluation of Latency Reducing and Tolerating Techniques. In Proc. of the 18th Annual Int'l Symp. on Computer Architecture, pages 254--263, May 1991.
.... with the DASH [11] multiprocessor shows 20 28 of utilization with a 66 80 hit ratio with a release consistency memory model [64] Multithreading has been introduced to tolerate the network latency by overlapping remote memory access of one thread with the computation of other threads [61] [65]. Analysis of the multithreaded processor has been studied in [66] 67] However, we believe that it would be preferable to reduce the long latency than to tolerate it as the variance of the communication latency becomes large [68] We plan to employ a multitasking scheme as a complementary ....
A.Gupta, J.Hennessy, et al, "Comparative Evaluation of Latency Reducing and Tolerating Techniques," Proc. Int. Symp. Comput. Arch., pp.254-263, 1991.
....such as the baseline protocol, the performance is often limited by processor stall times resulting from memory access latencies. To address this problem, techniques to either reduce or tolerate these latencies have been proposed and evaluated in the context of hardware based directory protocols [1, 7, 8, 12, 15, 22, 25]. One of the objectives in VII is to evaluate the effectiveness of such techniques in the context of software only directory protocols. For software only directory protocols, the total execution time is prolonged not only by processor stall times, but also by protocol execution overhead resulting ....
A. Gupta, J. Hennessy, K. Gharachorloo, T. Mowry, and W-D. Weber, "Comparative Evaluation of Latency Reducing and Tolerating Techniques", In Proceedings of the 18th International Symposium on Computer Architecture, pages 254-263, May 1991. 18
....performance is often limited by processor stall times resulting from memory access latencies. To reduce these stall times, and thus increase the performance, several latency tolerating and reducing techniques have been proposed and evaluated in the context of hardware only directory protocols [6, 11, 14, 17]. In software only directory protocols, the invocation of software handlers in can also prolong the execution time in two ways. First, the handler latency may end up on the memory access path and thus increase the memory latency seen by the requesting processor. In [7] we proposed strategies to ....
A. Gupta, J. Hennessy, K. Gharachorloo, T. Mowry, and W-D. Weber, "Comparative Evaluation of Latency Reducing and Tolerating Techniques," Proc. 18th Int'l Symp. on Computer Architecture, pp. 254-263, May 1991.
....performance is often limited by processor stall times resulting from memory access latencies. To reduce processor stall times, and thus increase performance, several latencytolerating and reducing techniques have been proposed and evaluated in the context of hardware only directory protocols [8, 10, 13, 19, 31, 36]. In addition to the processor stall times, the invocation of software handlers on the compute processor in software only directory protocols can prolong the execution time. While the handler latency might end up on the memory access path and thus increase the memory access latency seen by the ....
....well in comparison with hardware centric implementations. Multithreading is a latency tolerating technique used in, e.g. the MIT Alewife [1] It has been shown to be effective in hiding processor stall times for accesses that require global actions by switching to another thread of computation [19]. However, since multiple threads run on the same processor, the number of global actions originating from each processor is likely to increase. As a result, the protocol execution overhead in a software only directory protocol is expected to increase. Therefore, building on our derived framework ....
A. Gupta, J. Hennessy, K. Gharachorloo, T. Mowry, and W.-D. Weber, Comparative evaluation of latency reducing and tolerating techniques, in Proc. 18th Int'l Symp. Computer Architecture," pp. 254#263, May 1991.
....is often limited by processor stall times resulting from memory access latencies. To reduce the processor stall times, and thus increase the performance, several latency tolerating and reducing techniques have been proposed and evaluated in the context of hardware only directory protocols [8, 10, 12, 16, 22, 26]. In addition to the processor stall times, the invocation of software handlers on the compute processor in software only directory protocols can prolong the execution time in two ways. First, the handler latency might end up on the memory access path and thus delay the memory access latency seen ....
....the execution times of HW and HW RC, and between the execution times of SW and SW RC in Figure 7 shows that release consistency gives a consistent performance improvement for both hardware only and software only directory protocols. This is consistent with results presented in earlier studies [13, 16]. Table 4: Execution time ratios between software only and hardware only directory protocols with and without the migratory optimization. Water LU Ocean MP3D ETR = E sw (SW) E hw (HW) 1.16 1.37 1.43 1.76 ETR = E sw (SW M) E hw (HW M) 1.11 1.36 1.53 1.46 19 By comparing the protocol ....
[Article contains additional citation context not shown here]
A. Gupta, J. Hennessy, K. Gharachorloo, T. Mowry, and W-D. Weber, "Comparative Evaluation of Latency Reducing and Tolerating Techniques," In Proceedings of the 18th International Symposium on Computer Architecture, pages 254-263, May 1991.
....SMT under a variety of application level workloads. Some workloads examined include SPEC (92 and 95) 42, 41] SPLASH 2 [22] MPEG 2 decompression [35] and a database workload [21] Evaluations of other multithreading and CMP architectures have similarly been limited to application code only [3, 18, 6, 2, 37, 20, 14] or PALcode [47] Our study is the first to measure operating system behavior on a Metric SMT Superscalar Apache only Apache OS Change Apache only Apache OS Change Branch misprediction rate ( 4.4 9.1 2.1x 3.3 7.4 2.2x BTB misprediction rate ( 36.7 59.6 62 31.1 55.3 77 L1 Icache miss ....
A. Gupta, J. Hennessy, K. Gharachorloo, T. Mowry, and W. Weber. Comparative evaluation of latency reducing and tolerating techniques. In 18th Annual International Symposium on Computer Architecture, May 1991.
....and evaluates the memory performance of scientific codes. Most of this research focuses on: a) hiding or tolerating memory latency, b) decreasing the number of cache misses incurred, or c) avoiding bank conflicts in an interleaved memory system. Nonblocking caches and prefetching to cache [Bae91,Cal91,Dah94, Gup91,Kla91, Mow92,Soh91], prefetching to registers (as in the IBM 3033 [Kog81] or as proposed by Fu, Patel, and Janssens [FuP92] or prefetching to special preload buffers [FuP91] can be Chapter 2: Access Ordering 39 used to overlap memory accesses with computation, or to overlap the latencies of more than one access. ....
....that occur on different processors. The fewer assurances the system makes with respect to the order of events, the greater the potential overlap of operations within the same processor and among different processors [Lil93] Exploiting this potential concurrency can increase system performance [Gha91,Gup91,Tor90,Zuc92]. The sequential consistency model requires that all memory operations are executed in the order defined by the program, and that each access to the shared memory must complete before the next shared memory access can begin [Lil93] In other words, the execution of the parallel program must ....
A. Gupta, J. Hennessy, K. Gharachorloo, T. Mowry, and W.-D. Weber, "Comparative Evaluation of Latency Reducing and Tolerating Techniques", Proceedings of the 18th Annual International Symposium on Computer Architecture (ISCA), published as ACM SIGARCH Computer Architecture News, SIGARCH Computer Architecture News:254-263, May 1991.
....remove the long latency operations completely. To address the performance loss associated with remote cache misses, several latency tolerating schemes have been proposed, including relaxed memory consistency models [7] prefetching [17] and multiple context processors [13, 22, 26] Recent studies [8, 13] have shown multiple contexts to be a promising way to address the problem; this paper focuses on the multiple context solution. James Laudon is currently at Silicon Graphics, 2011 N. Shoreline Blvd. Mountain View, CA 94043. C achl 8 C ach I Scalable Interconnection Network Figure 1: ....
....remote memory accesses and interprocess syn chronization. Most existing multiple context designs have targeted large scale multiprocessors. Many applications running in this environment have substantial parallelism and application performance is often dominated by the large remote memory latency [8, 13]. In contrast, high performance commodity microprocessors primarily target the workstation environment. Parallelism is less abundant in worksta tion workloads, and may consist of running a large application in the background while editing, reading mail, video conferencing, or stressing ....
Anoop Gupta, John Hennessy, Kourosh Gharachorloo, Todd Mowry, and Wolf-Dietrich Weber. Comparative evaluation of latency reducing and tolerating techniques. In Proceeding of the 18th Annual International Symposium on Computer Architecture, pages 25z1263, May 1991.
....spend a significant amount of time on memory accesses. In fact, 8 out of the 13 programs spend more than half of their time stalled for memory accesses. 1. 2 Memory Hierarchy Optimizations Various hardware and software approaches to improve the memory performance have been proposed recently[15]. A promising technique to mitigate the impact of long cache miss penalties is softwarecontrolled prefetching[5, 13, 16, 22, 23] Software controlled prefetching requires support from both hardware and software. The processor must provide a special prefetch instruction. The soft ware uses this ....
A. Gupta, J. Hennessy, K. Gharachorloo, T. Mowry, and W-D. Weber. Comparative evaluation of latency reducing and tolerating techniques. In Proceedings of the 18th Annual International Symposium on Computer Architecture, pages 254263, May 1991.
No context found.
A. Gupta, J. Hennessy, K. Gharachorloo, T. Mowry, and W.-D. Weber, "Comparative evaluation of latency reducing and tolerating techniques," Proceedings of the 18th Annual International Symposium on Computer Architecture, pp. 254-263, 1991.
No context found.
Anoop Gupta, John Hennessy, Kourosh Gharachorloo, Todd Mowry, and WolfDietrich Weber. Comparative Evaluation of Latency Reducing and Tolerating Techniques. In The 18th Annual Int. Symp. on Computer Architecture, pages 254--263, 1991.
No context found.
Anoop Gupta, John Hennessy, Kourosh Gharachorloo, Todd Mowry, and Wolf-Dietrich Weber, " Comparative Evaluation of Latency Reducing and Tolerating Techniques," in Proceedings of the 18 International Symposium on Computer Architecture, pp. 254-263, May 1991.
No context found.
A. Gupta, J. Hennessy, K. Gharachorloo, T. Mowry, and W-D. Weber. Comparative evaluation of latency reducing and tolerating techniques. In 18th International Symposium on Computer Architecture, pages 254--63, May 1991.
No context found.
Gupta, A., Hennessy, J., Gharachorloo, K., Mowry, T., and Weber, W.-D., "Comparative Evaluation of Latency Reducing and Tolerating Techniques", Proc. 18th International Symposium on Computer Architecture, May 1991.
No context found.
A. Gupta, J. Hennessy, K. Gharachorloo, T. Mowry, and W.D. Weber. Comparative evaluation of latency reducing and tolerating techniques. In Proceedings of the Eighteenth International Symposium on Computer Architecture, 1991.
No context found.
Anoop Gupta, John Hennessy, Kourosh Gharachorloo, Todd Mowry, and Wolf-Dietrich Weber, "Comparative Evaluation of Latency Reducing and Tolerating Techniques," in Proceedings of the 18 Symposium on Computer Architecture, pp. 254-263, May 1991.
No context found.
Anoop Gupta, John Hennessy, Kourosh Gharachorloo, Todd Mowry, and Wolf-Dietrich Weber. Comparative Evaluation of Latency Reducing and Tolerating Techniques. In Proceedings of the 18th Annual International Symposium on Computer Architecture, pages 254--263, May 1991.
No context found.
Anoop Gupta, J. Hennessy, K. Gharachorloo, T. Mowry, and W. D. Weber "Comparative Evaluation of Latency Reducing and Tolerating Techniques ", ACM Trans. of Computer, 1991.
No context found.
Anoop Gupta, John Hennessy, Kourosh Gharachorloo, Todd Mowry, and Wolf-Dietrich Weber. Comparative evaluation of latency reducing and tolerating techniques. In International Symposium on Computer Architecture, pages 254--263, May 1991.
No context found.
Anoop Gupta, John Hennessy, Kourosh Gharachorloo, Todd Mowry, and WolfDietrich Weber. Comparative evaluation of latency reducing and tolerating techniques. In Proceedings of the 18th Annual International Symposium on Computer Architecture, pages 254--263, June 1991.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC