y
Abstract:
Prefetching is a widely used consumer-initiated mechanism to hide communication latency in sharedmemory multiprocessors. However, prefetching is inapplicable or insufficient for some communication patterns such as irregular communication, pipelined loops, and synchronization. For these cases, a combination of two fine-grain, producer-initiated primitives (referred to as remote-writes) is better able to reduce the latency of communication. This paper demonstrates experimentally that remote writes provide significant performance benefits in cache-coherent sharedmemory multiprocessors with and without prefetching. Further, the combination of remote writes and prefetching is able to eliminate most of the memory system overhead in the applications, except misses due to cache conflicts. 1

