Download:
by Zhichun Zhu, Zhao Zhang, Xiaodong Zhang
University of Science of Technology
http://www.cs.wm.edu/hpcs/WWW/HTML/publications/./papers/TR-02-1.pdf
Add To MetaCart
Abstract:
Configurations of contemporary DRAM memory systems become increasingly complex. A recent study [5] shows that application performance is highly sensitive to choices of configurations, and suggests that tuning burst sizes and channel configurations be an effective way to optimize the DRAM performance for a given memory-intensive workload. However, this approach is workload dependent. In this study we show that, by utilizing fine-grain priority access scheduling, we are able to find a workload independent configuration that achieves optimal performance on a multichannel memory system. Our approach can well utilize the available high concurrency and high bandwidth on such memory systems, and effectively reduce the memory stall time of memory-intensive applications. Conducting execution-driven simulation of a 4-way issue, 2 GHz processor, we show that the average performance improvement for fifteen memory-intensive SPEC2000 programs by using an optimized fine-grain priority scheduling is about 13 % and 8 % for a 2-channel and a 4-channel Direct Rambus DRAM memory systems, respectively, compared with gang scheduling. Compared with burst scheduling, the average performance improvement is 16 % and 14 % for the 2-channel and 4-channel memory systems, respectively. 1
Citations
|
1253
|
The Simplescalar toolset, version 2.0
– Burger, Austin
- 1997
|
|
356
|
The MIPS R10000 superscalar microprocessor
– Yeager
- 1996
|
|
102
|
Speculative precomputation: Longrange prefetching of delinquent loads
– Collins, Wang, et al.
- 2001
|
|
100
|
Execution-based Prediction Using Speculative Slices
– Zilles, Sohi
- 2001
|
|
95
|
Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors
– Luk
- 2001
|
|
88
|
A bandwidth-efficient architecture for media processing
– Rixner, Dally, et al.
- 1998
|
|
67
|
MemorySystem Design Considerations for Dynamically-Scheduled Processors
– Farkas, Chow, et al.
- 1997
|
|
52
|
Data prefetching by dependence graph precomputation
– Annavaram, Patel, et al.
|
|
52
|
Memory access scheduling
– Rixner, Dally, et al.
- 2000
|
|
50
|
Data Prefetch Mechanisms
– Vanderwiel, Lilja
- 2000
|
|
49
|
Reducing DRAM Latencies with an Integrated Memory Hierarchy Design
– Lin
- 2001
|
|
39
|
The PowerPC 604 RISC microprocessor
– Song, Denman, et al.
- 1994
|
|
38
|
Access ordering and memory-conscious cache utilization
– Wulf
- 1995
|
|
34
|
Dynamically allocating processor resources between nearby and distant ILP
– Balasubramonian, Dwarkadas, et al.
- 2001
|
|
26
|
Access Ordering and Effective Memory Bandwidth
– Moyer
- 1993
|
|
24
|
Design of a parallel vector access unit for SDRAM memory systems
– Mathew, McKee, et al.
- 2000
|
|
23
|
Access order and effective bandwidth for streams on a direct rambus memory
– Hong, McKee, et al.
- 1999
|
|
19
|
A Permutation-Based Page Interleaving Scheme to Reduce Row-Buffer Conflicts and Exploit Data Locality,” Proc. 33rd Int’l Symp. Microarchitecture
– Zhang, Zhu, et al.
- 2000
|
|
18
|
latency, or system overhead: Which has the largest impact on uniprocessor dram-system performance
– Cuppu, Jacob, et al.
|