Results 1 -
6 of
6
The Effectiveness of SRAM Network Caches in Clustered DSMs
, 1998
"... The frequency of accesses to remote data is a key factor affecting the performance of all Distributed Shared Memory (DSM) systems. Remote data caching is one of the most effective and general techniques to fight processor stalls due to remote capacity misses in the processor caches. The design space ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
The frequency of accesses to remote data is a key factor affecting the performance of all Distributed Shared Memory (DSM) systems. Remote data caching is one of the most effective and general techniques to fight processor stalls due to remote capacity misses in the processor caches. The design space of remote data caches (RDC) has many dimensions and one essential performance trade-off: hit ratio versus speed. Some recent commercial systems have opted for large and slow (S)DRAM network caches (NC), but others completely avoid them because of their damaging effects on the remote/local latency ratio. In this paper we will explore small and fast SRAM network caches as a means to reduce the remote stalls and capacity traffic of multiprocessor clusters. The major appeal of SRAM NCs is that they add less penalty on the latency of NC hits and remote accesses. Their small capacity can handle conflict misses and a limited amount of capacity misses. However, they can be coupled with main memory...
Cacheminer: A Runtime Approach to Exploit Cache Locality on SMP
- IEEE Transactions on Parallel and Distributed Systems
, 2000
"... Exploiting cache locality of parallel programs at runtime is a complementary approach to a compiler optimization. This is particularly important for those applications with dynamic memory access patterns. We propose a memory-layout oriented technique to exploit cache locality of parallel loops at ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
Exploiting cache locality of parallel programs at runtime is a complementary approach to a compiler optimization. This is particularly important for those applications with dynamic memory access patterns. We propose a memory-layout oriented technique to exploit cache locality of parallel loops at runtime on Symmetric Multiprocessor (SMP) systems. Guided by application-dependent and targeted architecture-dependent hints, our system, called Cacheminer, reorganizes and partitions a parallel loop using the memoryaccess space of its execution. Through effective runtime transformations, our system maximizes the data reuse in each partitioned data region assigned in a cache, and minimizes the data sharing among the partitioned data regions assigned to all caches. The executions of tasks in the partitions are scheduled in an adaptive and locality-preserved way to minimize the execution time of programs by trading off load balance and locality. We have implemented the Cacheminer runtime ...
A Memory-Layout Oriented Runtime Technique for Locality Optimization
- PROC. 1998 INT'L CONF. PARALLEL PROCESSING (ICPP '98)
, 1998
"... Exploiting locality at run-time is a complementary approach to a compiler approach for those applications with dynamic memory access patterns. This paper proposes a memory-layout oriented approach to exploit cache locality for parallel loops at run-time on Symmetric Multi-Processor (SMP) systems. Gu ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Exploiting locality at run-time is a complementary approach to a compiler approach for those applications with dynamic memory access patterns. This paper proposes a memory-layout oriented approach to exploit cache locality for parallel loops at run-time on Symmetric Multi-Processor (SMP) systems. Guided by applicationdependent hints and the targeted cache architecture, it reorganizes and partitions a parallel loop through shrinking and partitioning the memory-access space of the loop at run-time. In the generated task partitions, the data sharing among partitions is minimized and the data reuse in a partition is maximized. The execution of tasks in partitions is scheduled in an adaptive and locality-preserved way to achieve balanced execution, for minimizing the execution time of applications by trading off load balance and locality. Based on simulation and measurement, we show our run-time approach can achieve comparable performance with the compiler optimizations for two applications, whose load balance and cache locality can be well optimized by the tiling and other program transformations. However, our experimental results also show that our approach is able to significantly improve the memory performance for the applications with dynamic memory access patterns. This type of programs are usually hard to be optimized by compilers.
Design and Analysis of Static Memory Management Policies for CC-NUMA Multiprocessors
, 1998
"... The primary bottleneck of CC-NUMA architectures remains in the remote memory access that has latencies several magnitudes higher than the local cache access. Designing effective data allocation policies that provide local memory data access and limit the need to access remote memories remains a chal ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The primary bottleneck of CC-NUMA architectures remains in the remote memory access that has latencies several magnitudes higher than the local cache access. Designing effective data allocation policies that provide local memory data access and limit the need to access remote memories remains a challenge. We characterize the performance of three existing memory management techniques, namely, buddy, round-robin, and first-touch policies. With existing memory management schemes, we find several cases where requests from different processors arrive at the same memory simultaneously. This gives rise to bulky replies (in the form of data blocks) from the same memory. The cause of this bulkiness lies in the distribution of scientific data in sizes of a power of 2 distributed over a number of memory modules, also a power of 2. To alleviate this problem, we present two new memory management policies called skew-mapping and prime-mapping policies. By utilizing the properties of skewing and prim...
High Performance Switch Architectures For CC-Numa Multiprocessors
, 1999
"... High Performance Switch Architectures for CC-NUMA Multiprocessors. (August 1999) Ravishankar Iyer, B.S., Texas A&M University; M.S., Texas A&M University Chair of Advisory Committee: Dr. Laxmi Bhuyan Shared-memory multiprocessors are capable of providing significant performance benefits for scientif ..."
Abstract
- Add to MetaCart
High Performance Switch Architectures for CC-NUMA Multiprocessors. (August 1999) Ravishankar Iyer, B.S., Texas A&M University; M.S., Texas A&M University Chair of Advisory Committee: Dr. Laxmi Bhuyan Shared-memory multiprocessors are capable of providing significant performance benefits for scientific and commercial applications. Most recent multiprocessors employ the cache coherent non-uniform memory access (CC-NUMA) architecture. This dissertation is focussed on various issues in CC-NUMA multiprocessor design and evaluation. Crossbar switches are excellent building blocks for designing high performance interconnection networks for CC-NUMA multiprocessors. In this dissertation, four switch design alternatives are presented for multistage interconnection networks (MINs). By modeling these switches in a execution driven simulator, performance metrics such as average message latency, stall time and execution time are measured. Performance bottlenecks such as waiting delays and network in...
Design andanalyV% of
"... In this paper, we characterize the performance of three existingmemory management techniques,namely buddy round-robin, and first-touch policies. With existingmemory management schemes, we find several cases where requests from di#erent processors arrive at the samememory simultaneously To al ..."
Abstract
- Add to MetaCart
In this paper, we characterize the performance of three existingmemory management techniques,namely buddy round-robin, and first-touch policies. With existingmemory management schemes, we find several cases where requests from di#erent processors arrive at the samememory simultaneously To alleviate this problem, we present two improvedmemory management policies called skew-mapping and prime-mapping policies.By utilizing the properties of skewing and prime, the improvedmemory management designsconsiderably improve the application performance of cache coherent non-uniformmemory access multiprocessors. We also re-evaluate the performance of a multistage interconnection network using these existing and improvedmemory management policies. Our resultse#ectively present the performance benefits of di#erentmemory management techniques based on the sharing patterns of applications. Applications with a low degree of sharing benefit from the datalocality providedby first-touch. However, several applications with significant sharing degrees as well as those with single processor initialization routines benefithighly from the intelligent distribution of data providedby skew-mapping and prime-mapping schemes. Improvements due to the new schemes are found to be as high as 35% in stall time.

